public inbox for [email protected]
help / color / mirror / Atom feedRe: Support LIKE with nondeterministic collations
13+ messages / 6 participants
[nested] [flat]
* Re: Support LIKE with nondeterministic collations
@ 2024-04-30 12:39 Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-06-28 06:31 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
0 siblings, 2 replies; 13+ messages in thread
From: Daniel Verite @ 2024-04-30 12:39 UTC (permalink / raw)
To: Peter Eisentraut <[email protected]>; +Cc: pgsql-hackers
Peter Eisentraut wrote:
> This patch adds support for using LIKE with nondeterministic
> collations. So you can do things such as
>
> col LIKE 'foo%' COLLATE case_insensitive
Nice!
> The pattern is partitioned into substrings at wildcard characters
> (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then
> whole predicate matches if a match can be found for each partition
> under the applicable collation
Trying with a collation that ignores punctuation:
postgres=# CREATE COLLATION "ign_punct" (
provider = 'icu',
locale='und-u-ka-shifted',
deterministic = false
);
postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct;
?column?
----------
t
(1 row)
postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct;
?column?
----------
t
(1 row)
postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
----------
f
(1 row)
The first two results look fine, but the next one is inconsistent.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
@ 2024-05-02 13:38 ` Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
1 sibling, 1 reply; 13+ messages in thread
From: Peter Eisentraut @ 2024-05-02 13:38 UTC (permalink / raw)
To: Daniel Verite <[email protected]>; +Cc: pgsql-hackers
On 30.04.24 14:39, Daniel Verite wrote:
> postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
> ?column?
> ----------
> f
> (1 row)
>
> The first two results look fine, but the next one is inconsistent.
This is correct, because '_' means "any single character". This is
independent of the collation.
I think with nondeterministic collations, the single-character wildcard
is often not going to be all that useful.
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
@ 2024-05-03 00:11 ` Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Robert Haas @ 2024-05-03 00:11 UTC (permalink / raw)
To: Peter Eisentraut <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers
On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <[email protected]> wrote:
> On 30.04.24 14:39, Daniel Verite wrote:
> > postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
> > ?column?
> > ----------
> > f
> > (1 row)
> >
> > The first two results look fine, but the next one is inconsistent.
>
> This is correct, because '_' means "any single character". This is
> independent of the collation.
Seems really counterintuitive. I had to think for a long time to be
able to guess what was happening here. Finally I came up with this
guess:
If the collation-aware matching tries to match up f with the initial
period, the period is skipped and the f matches f. But when the
wildcard is matched to the initial period, that uses up the wildcard
and then we're left trying to match o with f, which doesn't work.
Is that right?
It'd probably be good to use something like this as an example in the
documentation. My intuition is that if foo matches a string, then _oo
f_o and fo_ should also match that string. Apparently that's not the
case, but I doubt I'll be the last one who thinks it should be.
--
Robert Haas
EDB: http://www.enterprisedb.com
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
@ 2024-05-03 08:52 ` Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Peter Eisentraut @ 2024-05-03 08:52 UTC (permalink / raw)
To: Robert Haas <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers
On 03.05.24 02:11, Robert Haas wrote:
> On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <[email protected]> wrote:
>> On 30.04.24 14:39, Daniel Verite wrote:
>>> postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
>>> ?column?
>>> ----------
>>> f
>>> (1 row)
>>>
>>> The first two results look fine, but the next one is inconsistent.
>>
>> This is correct, because '_' means "any single character". This is
>> independent of the collation.
>
> Seems really counterintuitive. I had to think for a long time to be
> able to guess what was happening here. Finally I came up with this
> guess:
>
> If the collation-aware matching tries to match up f with the initial
> period, the period is skipped and the f matches f. But when the
> wildcard is matched to the initial period, that uses up the wildcard
> and then we're left trying to match o with f, which doesn't work.
Formally, what
X like '_oo'
means is, can X be partitioned into substrings such that the first
substring is a single character and the second substring is equal to
'oo' under the applicable collation? This is false in this case, there
is no such partitioning.
What the implementation does is, it walks through the pattern. It sees
'_', so it steps over one character in the input string, which is '.'
here. Then we have 'foo.' left to match in the input string. Then it
takes from the pattern the next substring up to but not including either
a wildcard character or the end of the string, which is 'oo', and then
it checks if a prefix of the remaining input string can be found that is
"equal to" 'oo'. So here it would try in turn
'' = 'oo' collate ign_punct ?
'f' = 'oo' collate ign_punct ?
'fo' = 'oo' collate ign_punct ?
'foo' = 'oo' collate ign_punct ?
'foo.' = 'oo' collate ign_punct ?
and they all fail, so the match fails.
> It'd probably be good to use something like this as an example in the
> documentation. My intuition is that if foo matches a string, then _oo
> f_o and fo_ should also match that string. Apparently that's not the
> case, but I doubt I'll be the last one who thinks it should be.
This intuition fails because with nondeterministic collations, strings
of different lengths can be equal, and so the question arises, what does
the pattern '_' mean. It could mean either, (1) a single character, or
perhaps something like, (2) a string that is equal to some other string
of length one.
The second definition would satisfy the expectation here, because then
'.f' matches '_' because '.f' is equal to some string of length one,
such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.)
However, off the top of my head, this definition has three flaws: (1)
It would make the single-character wildcard effectively an
any-number-of-characters wildcard, but only in some circumstances, which
could be confusing, (2) it would be difficult to compute, because you'd
have to check equality against all possible single-character strings,
and (3) it is not what the SQL standard says.
In any case, yes, some explanation and examples should be added.
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
@ 2024-05-03 13:20 ` Robert Haas <[email protected]>
2024-05-03 14:12 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Robert Haas @ 2024-05-03 13:20 UTC (permalink / raw)
To: Peter Eisentraut <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers
On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <[email protected]> wrote:
> What the implementation does is, it walks through the pattern. It sees
> '_', so it steps over one character in the input string, which is '.'
> here. Then we have 'foo.' left to match in the input string. Then it
> takes from the pattern the next substring up to but not including either
> a wildcard character or the end of the string, which is 'oo', and then
> it checks if a prefix of the remaining input string can be found that is
> "equal to" 'oo'. So here it would try in turn
>
> '' = 'oo' collate ign_punct ?
> 'f' = 'oo' collate ign_punct ?
> 'fo' = 'oo' collate ign_punct ?
> 'foo' = 'oo' collate ign_punct ?
> 'foo.' = 'oo' collate ign_punct ?
>
> and they all fail, so the match fails.
Interesting. Does that imply that these matches are slower than normal ones?
> The second definition would satisfy the expectation here, because then
> '.f' matches '_' because '.f' is equal to some string of length one,
> such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.)
> However, off the top of my head, this definition has three flaws: (1)
> It would make the single-character wildcard effectively an
> any-number-of-characters wildcard, but only in some circumstances, which
> could be confusing, (2) it would be difficult to compute, because you'd
> have to check equality against all possible single-character strings,
> and (3) it is not what the SQL standard says.
Right, those are good arguments.
> In any case, yes, some explanation and examples should be added.
Cool.
--
Robert Haas
EDB: http://www.enterprisedb.com
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
@ 2024-05-03 14:12 ` Peter Eisentraut <[email protected]>
2024-05-03 14:58 ` Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Peter Eisentraut @ 2024-05-03 14:12 UTC (permalink / raw)
To: Robert Haas <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers
On 03.05.24 15:20, Robert Haas wrote:
> On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <[email protected]> wrote:
>> What the implementation does is, it walks through the pattern. It sees
>> '_', so it steps over one character in the input string, which is '.'
>> here. Then we have 'foo.' left to match in the input string. Then it
>> takes from the pattern the next substring up to but not including either
>> a wildcard character or the end of the string, which is 'oo', and then
>> it checks if a prefix of the remaining input string can be found that is
>> "equal to" 'oo'. So here it would try in turn
>>
>> '' = 'oo' collate ign_punct ?
>> 'f' = 'oo' collate ign_punct ?
>> 'fo' = 'oo' collate ign_punct ?
>> 'foo' = 'oo' collate ign_punct ?
>> 'foo.' = 'oo' collate ign_punct ?
>>
>> and they all fail, so the match fails.
>
> Interesting. Does that imply that these matches are slower than normal ones?
Yes, certainly, and there is also no indexing support (other than for
exact matches).
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 14:12 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
@ 2024-05-03 14:58 ` Daniel Verite <[email protected]>
2024-05-03 18:53 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-07-31 22:26 ` Re: Support LIKE with nondeterministic collations Jeff Davis <[email protected]>
0 siblings, 2 replies; 13+ messages in thread
From: Daniel Verite @ 2024-05-03 14:58 UTC (permalink / raw)
To: Peter Eisentraut <[email protected]>; +Cc: Robert Haas <[email protected]>; pgsql-hackers
Peter Eisentraut wrote:
> Yes, certainly, and there is also no indexing support (other than for
> exact matches).
The ICU docs have this note about prefix matching:
https://unicode-org.github.io/icu/userguide/collation/architecture.html#generating-bounds-for-a-sort...
* Generating bounds for a sort key (prefix matching)
Having sort keys for strings allows for easy creation of bounds -
sort keys that are guaranteed to be smaller or larger than any sort
key from a give range. For example, if bounds are produced for a
sortkey of string “smith”, strings between upper and lower bounds
with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
of upper bounds can be generated - the first one will match only
strings of equal length, while the second one will match all the
strings with the same initial prefix.
CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
the maximum primary weight, so that for example the string
“smith\uFFFF” can be used as the upper bound rather than modifying
the sort key for “smith”.
In other words it says that
col LIKE 'smith%' collate "nd"
is equivalent to:
col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
which could be obtained from an index scan, assuming a btree
index on "col" collate "nd".
U+FFFF is a valid code point but a "non-character" [1] so it's
not supposed to be present in normal strings.
[1] https://www.unicode.org/glossary/#noncharacter
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 14:12 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 14:58 ` Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
@ 2024-05-03 18:53 ` Peter Eisentraut <[email protected]>
1 sibling, 0 replies; 13+ messages in thread
From: Peter Eisentraut @ 2024-05-03 18:53 UTC (permalink / raw)
To: Daniel Verite <[email protected]>; +Cc: Robert Haas <[email protected]>; pgsql-hackers
On 03.05.24 16:58, Daniel Verite wrote:
> * Generating bounds for a sort key (prefix matching)
>
> Having sort keys for strings allows for easy creation of bounds -
> sort keys that are guaranteed to be smaller or larger than any sort
> key from a give range. For example, if bounds are produced for a
> sortkey of string “smith”, strings between upper and lower bounds
> with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
> of upper bounds can be generated - the first one will match only
> strings of equal length, while the second one will match all the
> strings with the same initial prefix.
>
> CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
> the maximum primary weight, so that for example the string
> “smith\uFFFF” can be used as the upper bound rather than modifying
> the sort key for “smith”.
>
> In other words it says that
>
> col LIKE 'smith%' collate "nd"
>
> is equivalent to:
>
> col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
>
> which could be obtained from an index scan, assuming a btree
> index on "col" collate "nd".
>
> U+FFFF is a valid code point but a "non-character" [1] so it's
> not supposed to be present in normal strings.
Thanks, this could be very useful!
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 14:12 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 14:58 ` Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
@ 2024-07-31 22:26 ` Jeff Davis <[email protected]>
2026-05-18 11:51 ` Re: Support LIKE with nondeterministic collations Nico Williams <[email protected]>
1 sibling, 1 reply; 13+ messages in thread
From: Jeff Davis @ 2024-07-31 22:26 UTC (permalink / raw)
To: Daniel Verite <[email protected]>; Peter Eisentraut <[email protected]>; +Cc: Robert Haas <[email protected]>; pgsql-hackers
On Fri, 2024-05-03 at 16:58 +0200, Daniel Verite wrote:
> * Generating bounds for a sort key (prefix matching)
>
> Having sort keys for strings allows for easy creation of bounds -
> sort keys that are guaranteed to be smaller or larger than any
> sort
> key from a give range. For example, if bounds are produced for a
> sortkey of string “smith”, strings between upper and lower bounds
> with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
> of upper bounds can be generated - the first one will match only
> strings of equal length, while the second one will match all the
> strings with the same initial prefix.
>
> CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
> the maximum primary weight, so that for example the string
> “smith\uFFFF” can be used as the upper bound rather than modifying
> the sort key for “smith”.
>
> In other words it says that
>
> col LIKE 'smith%' collate "nd"
>
> is equivalent to:
>
> col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
That logic seems to assume something about the collation. If you have a
collation that orders strings by their sha256 hash, that would entirely
break the connection between prefixes and ranges, and it wouldn't work.
Is there something about the way collations are defined that inherently
maintains a connection between a prefix and a range? Does it remain
true even when strange rules are added to a collation?
Regards,
Jeff Davis
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 08:52 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Re: Support LIKE with nondeterministic collations Robert Haas <[email protected]>
2024-05-03 14:12 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
2024-05-03 14:58 ` Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-07-31 22:26 ` Re: Support LIKE with nondeterministic collations Jeff Davis <[email protected]>
@ 2026-05-18 11:51 ` Nico Williams <[email protected]>
0 siblings, 0 replies; 13+ messages in thread
From: Nico Williams @ 2026-05-18 11:51 UTC (permalink / raw)
To: Jeff Davis <[email protected]>; +Cc: Daniel Verite <[email protected]>; Peter Eisentraut <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers
On Wed, Jul 31, 2024 at 03:26:34PM -0700, Jeff Davis wrote:
> On Fri, 2024-05-03 at 16:58 +0200, Daniel Verite wrote:
> > In other words it says that
> >
> > col LIKE 'smith%' collate "nd"
> >
> > is equivalent to:
> >
> > col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
>
> That logic seems to assume something about the collation. If you have a
> collation that orders strings by their sha256 hash, that would entirely
> break the connection between prefixes and ranges, and it wouldn't work.
The hash of what? each character's names or canonical representations in
some UTF? If so, then, to maintain the above equivalence one would have
to alter the definition of this 'hash-based collation' so that U+FFFF is
always "last".
> Is there something about the way collations are defined that inherently
> maintains a connection between a prefix and a range? [...]
Yes: rules like the one Daniel described.
> [...]? Does it remain
> true even when strange rules are added to a collation?
There are 'strange rules' which cannot be used in defining a collation,
as the result would not then be a collation.
Nico
--
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
@ 2024-06-28 06:31 ` Peter Eisentraut <[email protected]>
2026-05-18 10:23 ` Re: Support LIKE with nondeterministic collations Jelte Fennema-Nio <[email protected]>
1 sibling, 1 reply; 13+ messages in thread
From: Peter Eisentraut @ 2024-06-28 06:31 UTC (permalink / raw)
To: pgsql-hackers; +Cc: Daniel Verite <[email protected]>
Here is an updated patch for this.
I have added some more documentation based on the discussions, including
some examples taken directly from the emails here.
One thing I have been struggling with a bit is the correct use of
LIKE_FALSE versus LIKE_ABORT in the MatchText() code. I have made some
small tweaks about this in this version that I think are more correct,
but it could use another look. Maybe also some more tests to verify
this one way or the other.
On 30.04.24 14:39, Daniel Verite wrote:
> Peter Eisentraut wrote:
>
>> This patch adds support for using LIKE with nondeterministic
>> collations. So you can do things such as
>>
>> col LIKE 'foo%' COLLATE case_insensitive
>
> Nice!
>
>> The pattern is partitioned into substrings at wildcard characters
>> (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then
>> whole predicate matches if a match can be found for each partition
>> under the applicable collation
>
> Trying with a collation that ignores punctuation:
>
> postgres=# CREATE COLLATION "ign_punct" (
> provider = 'icu',
> locale='und-u-ka-shifted',
> deterministic = false
> );
>
> postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct;
> ?column?
> ----------
> t
> (1 row)
>
> postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct;
> ?column?
> ----------
> t
> (1 row)
>
> postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
> ?column?
> ----------
> f
> (1 row)
>
> The first two results look fine, but the next one is inconsistent.
From 34f5bb1e8f0ffbb39b1efc9777736f6b4d6c4caa Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <[email protected]>
Date: Fri, 28 Jun 2024 06:55:45 +0200
Subject: [PATCH v2] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error. This adds the internal implementation and
removes the error.
Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring. For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'. This is because the collation could
consider substrings of different lengths to be equal. This is all
internal to MatchText() in like_match.c.
The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.
Note that ILIKE is not affected. It's unclear whether ILIKE makes
sense under nondeterministic collations.
This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations. This was already alluded to in the previous code.
(includes documentation examples from Daniel Vérité)
Discussion: https://www.postgresql.org/message-id/flat/[email protected]
---
doc/src/sgml/charset.sgml | 2 +-
doc/src/sgml/func.sgml | 51 +++++++-
src/backend/utils/adt/like.c | 30 +++--
src/backend/utils/adt/like_match.c | 118 ++++++++++++++++++
src/backend/utils/adt/like_support.c | 29 ++---
.../regress/expected/collate.icu.utf8.out | 81 ++++++++++--
src/test/regress/sql/collate.icu.utf8.sql | 23 +++-
7 files changed, 292 insertions(+), 42 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..533b3af9045 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
to a performance penalty. Note, in particular, that B-tree cannot use
deduplication with indexes that use a nondeterministic collation. Also,
certain operations are not possible with nondeterministic collations,
- such as pattern matching operations. Therefore, they should be used
+ such as some pattern matching operations. Therefore, they should be used
only in cases where they are specifically wanted.
</para>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2609269610b..833db120cb3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5374,9 +5374,10 @@ <title>Pattern Matching</title>
</caution>
<para>
- The pattern matching operators of all three kinds do not support
- nondeterministic collations. If required, apply a different collation to
- the expression to work around this limitation.
+ <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+ expressions do not support nondeterministic collations. If required, use
+ <function>LIKE</function> or apply a different collation to the expression
+ to work around this limitation.
</para>
<sect2 id="functions-like">
@@ -5422,6 +5423,45 @@ <title><function>LIKE</function></title>
</programlisting>
</para>
+ <para>
+ <function>LIKE</function> pattern matching supports nondeterministic
+ collations (see <xref linkend="collation-nondeterministic"/>), such as
+ case-insensitive collations or collations that, say, ignore punctuation.
+ So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive <lineannotation>true</lineannotation>
+</programlisting>
+ With collations that ignore certain characters or in general that consider
+ strings of different lengths equal, the semantics can become a bit more
+ complicated. Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct <lineannotation>false</lineannotation>
+</programlisting>
+ The way the matching works is that the pattern is partitioned into
+ sequences of wildcards and non-wildcard strings. For example, the pattern
+ <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+ pattern <literal>_oo</literal> is partitioned into <literal>_,
+ oo</literal>. The input string matches the pattern if it can be
+ partitioned in such a way that the wildcards match one character or any
+ number of characters respectively and the non-wildcard partitions are
+ equal under the applicable collation. So for example, <literal>'.foo.'
+ LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+ <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+ <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+ matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+ ign_punct</literal>. But <literal>'.foo.' LIKE '_oo' COLLATE
+ ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+ partitioned in a way that the first character is any character and the
+ rest of the string compares equal to <literal>oo</literal>. (Note that
+ the single-character wildcard always matches exactly one character,
+ independent of the collation. So in this example, the
+ <literal>_</literal> would match <literal>.</literal>, but then the rest
+ of the input string won't match the rest of the pattern.)
+ </para>
+
<para>
<function>LIKE</function> pattern matching always covers the entire
string. Therefore, if it's desired to match a sequence anywhere within
@@ -5463,8 +5503,9 @@ <title><function>LIKE</function></title>
<para>
The key word <token>ILIKE</token> can be used instead of
- <token>LIKE</token> to make the match case-insensitive according
- to the active locale. This is not in the <acronym>SQL</acronym> standard but is a
+ <token>LIKE</token> to make the match case-insensitive according to the
+ active locale. (But this does not support nondeterministic collations.)
+ This is not in the <acronym>SQL</acronym> standard but is a
<productname>PostgreSQL</productname> extension.
</para>
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
static inline int
GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
{
- if (collation && !lc_ctype_is_c(collation))
- {
- pg_locale_t locale = pg_newlocale_from_collation(collation);
+ pg_locale_t locale = 0;
+ bool locale_is_c = false;
- if (!pg_locale_deterministic(locale))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("nondeterministic collations are not supported for LIKE")));
+ if (!OidIsValid(collation))
+ {
+ /*
+ * This typically means that the parser could not resolve a conflict
+ * of implicit collations, so report it that way.
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_INDETERMINATE_COLLATION),
+ errmsg("could not determine which collation to use for LIKE"),
+ errhint("Use the COLLATE clause to set the collation explicitly.")));
}
+ if (lc_ctype_is_c(collation))
+ locale_is_c = true;
+ else
+ locale = pg_newlocale_from_collation(collation);
+
if (pg_database_encoding_max_length() == 1)
- return SB_MatchText(s, slen, p, plen, 0, true);
+ return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
else if (GetDatabaseEncoding() == PG_UTF8)
- return UTF8_MatchText(s, slen, p, plen, 0, true);
+ return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
else
- return MB_MatchText(s, slen, p, plen, 0, true);
+ return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
}
static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f2990edff7e..0751b329b8e 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -198,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int plen,
NextByte(p, plen);
continue;
}
+ else if (locale && !locale->deterministic)
+ {
+ /*
+ * For nondeterministic locales, we find the next substring of the
+ * pattern that does not contain wildcards and try to find a
+ * matching substring in the text. Crucially, we cannot do this
+ * character by character, as in the normal case, but must do it
+ * substring by substring, partitioned by the wildcard characters.
+ */
+ const char *p1;
+ size_t p1len;
+ const char *t1;
+ size_t t1len;
+ bool found_escape;
+ const char *subpat;
+ size_t subpatlen;
+ char *buf = NULL;
+
+ /*
+ * Determine next substring of pattern without wildcards. p is
+ * the start of the subpattern, p1 is one past the last byte. Also
+ * track if we found an escape character.
+ */
+ p1 = p;
+ p1len = plen;
+ found_escape = false;
+ while (p1len > 0)
+ {
+ if (*p1 == '\\')
+ {
+ found_escape = true;
+ NextByte(p1, p1len);
+ }
+ else if (*p1 == '_' || *p1 == '%')
+ break;
+ NextByte(p1, p1len);
+ }
+
+ /*
+ * If we found an escape character, then make an unescaped copy of
+ * the subpattern.
+ */
+ if (found_escape)
+ {
+ char *b;
+
+ b = buf = palloc(p1 - p);
+ for (const char *c = p; c < p1; c++)
+ {
+ if (*c == '\\')
+ ;
+ else
+ *(b++) = *c;
+ }
+
+ subpat = buf;
+ subpatlen = b - buf;
+ }
+ else
+ {
+ subpat = p;
+ subpatlen = p1 - p;
+ }
+
+ /*
+ * Now build a substring of the text and try to match it against
+ * the subpattern. t is the start of the text, t1 is one past the
+ * last byte. We start with a zero-length string.
+ */
+ t1 = t;
+ t1len = tlen;
+ for (;;)
+ {
+ int cmp;
+
+ cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+ /*
+ * If we found a match, we have to test if the rest of pattern
+ * can match against the rest of the string. Otherwise we
+ * have to continue here try matching with a longer substring.
+ * (This is similar to the recursion for the '%' wildcard
+ * above.)
+ *
+ * Note that we can't just wind forward p and t and continue
+ * with the main loop. This would fail for example with
+ *
+ * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+ *
+ * You'd find that t=\0061 matches p=\00E4, but then the rest
+ * won't match; but t=\0061\0308 also matches p=\00E4, and
+ * then the rest will match.
+ */
+ if (cmp == 0)
+ {
+ int matched = MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+ if (matched == LIKE_TRUE)
+ {
+ if (buf)
+ pfree(buf);
+ return matched;
+ }
+ }
+
+ /*
+ * Didn't match. If we used up the whole text, then the match
+ * fails. Otherwise, try again with a longer substring.
+ */
+ if (t1len == 0)
+ return LIKE_FALSE;
+ else
+ NextChar(t1, t1len);
+ }
+ if (buf)
+ pfree(buf);
+ continue;
+ }
else if (GETCHAR(*p) != GETCHAR(*t))
{
/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
return NIL;
patt = (Const *) rightop;
- /*
- * Not supported if the expression collation is nondeterministic. The
- * optimized equality or prefix tests use bytewise comparisons, which is
- * not consistent with nondeterministic collations. The actual
- * pattern-matching implementation functions will later error out that
- * pattern-matching is not supported with nondeterministic collations. (We
- * could also error out here, but by doing it later we get more precise
- * error messages.) (It should be possible to support at least
- * Pattern_Prefix_Exact, but no point as long as the actual
- * pattern-matching implementations don't support it.)
- *
- * expr_coll is not set for a non-collation-aware data type such as bytea.
- */
- if (expr_coll && !get_collation_isdeterministic(expr_coll))
- return NIL;
-
/*
* Try to extract a fixed prefix from the pattern.
*/
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
{
if (!op_in_opfamily(eqopr, opfamily))
return NIL;
+ if (indexcollation != expr_coll)
+ return NIL;
expr = make_opclause(eqopr, BOOLOID, false,
(Expr *) leftop, (Expr *) prefix,
InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
return result;
}
+ /*
+ * Anything other than Pattern_Prefix_Exact is not supported if the
+ * expression collation is nondeterministic. The optimized equality or
+ * prefix tests use bytewise comparisons, which is not consistent with
+ * nondeterministic collations.
+ *
+ * expr_coll is not set for a non-collation-aware data type such as bytea.
+ */
+ if (expr_coll && !get_collation_isdeterministic(expr_coll))
+ return NIL;
+
/*
* Otherwise, we have a nonempty required prefix of the values. Some
* opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 7d59fb44316..8bfdf33c2fa 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
NOTICE: using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column?
+----------
+ t
+(1 row)
+
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
2 | äbc
(2 rows)
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a | b
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a | b
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
-- same with arrays
CREATE TABLE test6a (a int, b text[]);
INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
(2 rows)
SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR: nondeterministic collations are not supported for LIKE
+ x
+-----
+ abc
+ ABC
+(2 rows)
+
SELECT x FROM test3ci WHERE x ILIKE 'a%';
ERROR: nondeterministic collations are not supported for ILIKE
SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
(2 rows)
SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR: nondeterministic collations are not supported for LIKE
+ x
+-----
+ abc
+ ABC
+(2 rows)
+
SELECT x FROM test3bpci WHERE x ILIKE 'a%';
ERROR: nondeterministic collations are not supported for ILIKE
SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
-- This tests the issue described in match_pattern_prefix(). In the
-- absence of that check, the case_insensitive tests below would
-- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
INSERT INTO test4c VALUES ('abc');
CREATE INDEX ON test4c (x);
SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive; -- ok, no rows
---
(0 rows)
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- error
-ERROR: nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- error
-ERROR: nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- ok
+ x
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- ok
+ x
+-----
+ abc
+(1 row)
+
RESET enable_seqscan;
-- Unicode special case: different variants of Greek lower case sigma.
-- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,18 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
1 | cote
(1 row)
+-- This is a tricky one. A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore. Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column?
+----------
+ t
+(1 row)
+
-- foreign keys (should use collation of primary key)
-- PK is case-sensitive, FK is case-insensitive
CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..481a995c998 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
-- same with arrays
CREATE TABLE test6a (a int, b text[]);
INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x); -- error
-- This tests the issue described in match_pattern_prefix(). In the
-- absence of that check, the case_insensitive tests below would
-- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
INSERT INTO test4c VALUES ('abc');
CREATE INDEX ON test4c (x);
SET enable_seqscan = off;
SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive; -- ok, no rows
SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive; -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- ok
RESET enable_seqscan;
-- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,14 @@ CREATE TABLE test4 (a int, b text);
SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents; -- still case-sensitive
SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
+-- This is a tricky one. A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore. Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+
-- foreign keys (should use collation of primary key)
-- PK is case-sensitive, FK is case-insensitive
base-commit: 3e53492aa7084bceaa92757c90e067d79768797e
--
2.45.2
Attachments:
[text/plain] v2-0001-Support-LIKE-with-nondeterministic-collations.patch (21.0K, 2-v2-0001-Support-LIKE-with-nondeterministic-collations.patch)
download | inline diff:
From 34f5bb1e8f0ffbb39b1efc9777736f6b4d6c4caa Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <[email protected]>
Date: Fri, 28 Jun 2024 06:55:45 +0200
Subject: [PATCH v2] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error. This adds the internal implementation and
removes the error.
Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring. For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'. This is because the collation could
consider substrings of different lengths to be equal. This is all
internal to MatchText() in like_match.c.
The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.
Note that ILIKE is not affected. It's unclear whether ILIKE makes
sense under nondeterministic collations.
This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations. This was already alluded to in the previous code.
(includes documentation examples from Daniel Vérité)
Discussion: https://www.postgresql.org/message-id/flat/[email protected]
---
doc/src/sgml/charset.sgml | 2 +-
doc/src/sgml/func.sgml | 51 +++++++-
src/backend/utils/adt/like.c | 30 +++--
src/backend/utils/adt/like_match.c | 118 ++++++++++++++++++
src/backend/utils/adt/like_support.c | 29 ++---
.../regress/expected/collate.icu.utf8.out | 81 ++++++++++--
src/test/regress/sql/collate.icu.utf8.sql | 23 +++-
7 files changed, 292 insertions(+), 42 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..533b3af9045 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
to a performance penalty. Note, in particular, that B-tree cannot use
deduplication with indexes that use a nondeterministic collation. Also,
certain operations are not possible with nondeterministic collations,
- such as pattern matching operations. Therefore, they should be used
+ such as some pattern matching operations. Therefore, they should be used
only in cases where they are specifically wanted.
</para>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2609269610b..833db120cb3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5374,9 +5374,10 @@ <title>Pattern Matching</title>
</caution>
<para>
- The pattern matching operators of all three kinds do not support
- nondeterministic collations. If required, apply a different collation to
- the expression to work around this limitation.
+ <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+ expressions do not support nondeterministic collations. If required, use
+ <function>LIKE</function> or apply a different collation to the expression
+ to work around this limitation.
</para>
<sect2 id="functions-like">
@@ -5422,6 +5423,45 @@ <title><function>LIKE</function></title>
</programlisting>
</para>
+ <para>
+ <function>LIKE</function> pattern matching supports nondeterministic
+ collations (see <xref linkend="collation-nondeterministic"/>), such as
+ case-insensitive collations or collations that, say, ignore punctuation.
+ So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive <lineannotation>true</lineannotation>
+</programlisting>
+ With collations that ignore certain characters or in general that consider
+ strings of different lengths equal, the semantics can become a bit more
+ complicated. Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct <lineannotation>false</lineannotation>
+</programlisting>
+ The way the matching works is that the pattern is partitioned into
+ sequences of wildcards and non-wildcard strings. For example, the pattern
+ <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+ pattern <literal>_oo</literal> is partitioned into <literal>_,
+ oo</literal>. The input string matches the pattern if it can be
+ partitioned in such a way that the wildcards match one character or any
+ number of characters respectively and the non-wildcard partitions are
+ equal under the applicable collation. So for example, <literal>'.foo.'
+ LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+ <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+ <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+ matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+ ign_punct</literal>. But <literal>'.foo.' LIKE '_oo' COLLATE
+ ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+ partitioned in a way that the first character is any character and the
+ rest of the string compares equal to <literal>oo</literal>. (Note that
+ the single-character wildcard always matches exactly one character,
+ independent of the collation. So in this example, the
+ <literal>_</literal> would match <literal>.</literal>, but then the rest
+ of the input string won't match the rest of the pattern.)
+ </para>
+
<para>
<function>LIKE</function> pattern matching always covers the entire
string. Therefore, if it's desired to match a sequence anywhere within
@@ -5463,8 +5503,9 @@ <title><function>LIKE</function></title>
<para>
The key word <token>ILIKE</token> can be used instead of
- <token>LIKE</token> to make the match case-insensitive according
- to the active locale. This is not in the <acronym>SQL</acronym> standard but is a
+ <token>LIKE</token> to make the match case-insensitive according to the
+ active locale. (But this does not support nondeterministic collations.)
+ This is not in the <acronym>SQL</acronym> standard but is a
<productname>PostgreSQL</productname> extension.
</para>
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
static inline int
GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
{
- if (collation && !lc_ctype_is_c(collation))
- {
- pg_locale_t locale = pg_newlocale_from_collation(collation);
+ pg_locale_t locale = 0;
+ bool locale_is_c = false;
- if (!pg_locale_deterministic(locale))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("nondeterministic collations are not supported for LIKE")));
+ if (!OidIsValid(collation))
+ {
+ /*
+ * This typically means that the parser could not resolve a conflict
+ * of implicit collations, so report it that way.
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_INDETERMINATE_COLLATION),
+ errmsg("could not determine which collation to use for LIKE"),
+ errhint("Use the COLLATE clause to set the collation explicitly.")));
}
+ if (lc_ctype_is_c(collation))
+ locale_is_c = true;
+ else
+ locale = pg_newlocale_from_collation(collation);
+
if (pg_database_encoding_max_length() == 1)
- return SB_MatchText(s, slen, p, plen, 0, true);
+ return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
else if (GetDatabaseEncoding() == PG_UTF8)
- return UTF8_MatchText(s, slen, p, plen, 0, true);
+ return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
else
- return MB_MatchText(s, slen, p, plen, 0, true);
+ return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
}
static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f2990edff7e..0751b329b8e 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -198,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int plen,
NextByte(p, plen);
continue;
}
+ else if (locale && !locale->deterministic)
+ {
+ /*
+ * For nondeterministic locales, we find the next substring of the
+ * pattern that does not contain wildcards and try to find a
+ * matching substring in the text. Crucially, we cannot do this
+ * character by character, as in the normal case, but must do it
+ * substring by substring, partitioned by the wildcard characters.
+ */
+ const char *p1;
+ size_t p1len;
+ const char *t1;
+ size_t t1len;
+ bool found_escape;
+ const char *subpat;
+ size_t subpatlen;
+ char *buf = NULL;
+
+ /*
+ * Determine next substring of pattern without wildcards. p is
+ * the start of the subpattern, p1 is one past the last byte. Also
+ * track if we found an escape character.
+ */
+ p1 = p;
+ p1len = plen;
+ found_escape = false;
+ while (p1len > 0)
+ {
+ if (*p1 == '\\')
+ {
+ found_escape = true;
+ NextByte(p1, p1len);
+ }
+ else if (*p1 == '_' || *p1 == '%')
+ break;
+ NextByte(p1, p1len);
+ }
+
+ /*
+ * If we found an escape character, then make an unescaped copy of
+ * the subpattern.
+ */
+ if (found_escape)
+ {
+ char *b;
+
+ b = buf = palloc(p1 - p);
+ for (const char *c = p; c < p1; c++)
+ {
+ if (*c == '\\')
+ ;
+ else
+ *(b++) = *c;
+ }
+
+ subpat = buf;
+ subpatlen = b - buf;
+ }
+ else
+ {
+ subpat = p;
+ subpatlen = p1 - p;
+ }
+
+ /*
+ * Now build a substring of the text and try to match it against
+ * the subpattern. t is the start of the text, t1 is one past the
+ * last byte. We start with a zero-length string.
+ */
+ t1 = t;
+ t1len = tlen;
+ for (;;)
+ {
+ int cmp;
+
+ cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+ /*
+ * If we found a match, we have to test if the rest of pattern
+ * can match against the rest of the string. Otherwise we
+ * have to continue here try matching with a longer substring.
+ * (This is similar to the recursion for the '%' wildcard
+ * above.)
+ *
+ * Note that we can't just wind forward p and t and continue
+ * with the main loop. This would fail for example with
+ *
+ * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+ *
+ * You'd find that t=\0061 matches p=\00E4, but then the rest
+ * won't match; but t=\0061\0308 also matches p=\00E4, and
+ * then the rest will match.
+ */
+ if (cmp == 0)
+ {
+ int matched = MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+ if (matched == LIKE_TRUE)
+ {
+ if (buf)
+ pfree(buf);
+ return matched;
+ }
+ }
+
+ /*
+ * Didn't match. If we used up the whole text, then the match
+ * fails. Otherwise, try again with a longer substring.
+ */
+ if (t1len == 0)
+ return LIKE_FALSE;
+ else
+ NextChar(t1, t1len);
+ }
+ if (buf)
+ pfree(buf);
+ continue;
+ }
else if (GETCHAR(*p) != GETCHAR(*t))
{
/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
return NIL;
patt = (Const *) rightop;
- /*
- * Not supported if the expression collation is nondeterministic. The
- * optimized equality or prefix tests use bytewise comparisons, which is
- * not consistent with nondeterministic collations. The actual
- * pattern-matching implementation functions will later error out that
- * pattern-matching is not supported with nondeterministic collations. (We
- * could also error out here, but by doing it later we get more precise
- * error messages.) (It should be possible to support at least
- * Pattern_Prefix_Exact, but no point as long as the actual
- * pattern-matching implementations don't support it.)
- *
- * expr_coll is not set for a non-collation-aware data type such as bytea.
- */
- if (expr_coll && !get_collation_isdeterministic(expr_coll))
- return NIL;
-
/*
* Try to extract a fixed prefix from the pattern.
*/
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
{
if (!op_in_opfamily(eqopr, opfamily))
return NIL;
+ if (indexcollation != expr_coll)
+ return NIL;
expr = make_opclause(eqopr, BOOLOID, false,
(Expr *) leftop, (Expr *) prefix,
InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
return result;
}
+ /*
+ * Anything other than Pattern_Prefix_Exact is not supported if the
+ * expression collation is nondeterministic. The optimized equality or
+ * prefix tests use bytewise comparisons, which is not consistent with
+ * nondeterministic collations.
+ *
+ * expr_coll is not set for a non-collation-aware data type such as bytea.
+ */
+ if (expr_coll && !get_collation_isdeterministic(expr_coll))
+ return NIL;
+
/*
* Otherwise, we have a nonempty required prefix of the values. Some
* opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 7d59fb44316..8bfdf33c2fa 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
NOTICE: using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column?
+----------
+ t
+(1 row)
+
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
2 | äbc
(2 rows)
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a | b
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a | b
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
-- same with arrays
CREATE TABLE test6a (a int, b text[]);
INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
(2 rows)
SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR: nondeterministic collations are not supported for LIKE
+ x
+-----
+ abc
+ ABC
+(2 rows)
+
SELECT x FROM test3ci WHERE x ILIKE 'a%';
ERROR: nondeterministic collations are not supported for ILIKE
SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
(2 rows)
SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR: nondeterministic collations are not supported for LIKE
+ x
+-----
+ abc
+ ABC
+(2 rows)
+
SELECT x FROM test3bpci WHERE x ILIKE 'a%';
ERROR: nondeterministic collations are not supported for ILIKE
SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
-- This tests the issue described in match_pattern_prefix(). In the
-- absence of that check, the case_insensitive tests below would
-- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
INSERT INTO test4c VALUES ('abc');
CREATE INDEX ON test4c (x);
SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive; -- ok, no rows
---
(0 rows)
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- error
-ERROR: nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- error
-ERROR: nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- ok
+ x
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- ok
+ x
+-----
+ abc
+(1 row)
+
RESET enable_seqscan;
-- Unicode special case: different variants of Greek lower case sigma.
-- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,18 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
1 | cote
(1 row)
+-- This is a tricky one. A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore. Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column?
+----------
+ t
+(1 row)
+
-- foreign keys (should use collation of primary key)
-- PK is case-sensitive, FK is case-insensitive
CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..481a995c998 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
-- same with arrays
CREATE TABLE test6a (a int, b text[]);
INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x); -- error
-- This tests the issue described in match_pattern_prefix(). In the
-- absence of that check, the case_insensitive tests below would
-- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
INSERT INTO test4c VALUES ('abc');
CREATE INDEX ON test4c (x);
SET enable_seqscan = off;
SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive; -- ok, no rows
SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive; -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive; -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive; -- ok
RESET enable_seqscan;
-- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,14 @@ CREATE TABLE test4 (a int, b text);
SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents; -- still case-sensitive
SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
+-- This is a tricky one. A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore. Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+
-- foreign keys (should use collation of primary key)
-- PK is case-sensitive, FK is case-insensitive
base-commit: 3e53492aa7084bceaa92757c90e067d79768797e
--
2.45.2
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-06-28 06:31 ` Re: Support LIKE with nondeterministic collations Peter Eisentraut <[email protected]>
@ 2026-05-18 10:23 ` Jelte Fennema-Nio <[email protected]>
0 siblings, 0 replies; 13+ messages in thread
From: Jelte Fennema-Nio @ 2026-05-18 10:23 UTC (permalink / raw)
To: Peter Eisentraut <[email protected]>; +Cc: jian he <[email protected]>; Heikki Linnakangas <[email protected]>; Jacob Champion <[email protected]>; pgsql-hackers; Daniel Verite <[email protected]>; Paul A Jungwirth <[email protected]>
On Wed, 27 Nov 2024 at 09:01, Peter Eisentraut <[email protected]> wrote:
> I have committed it, thanks.
This commit introduced a regression[1], which makes psql \d and our
tests much slower than they need to be, because indexes are not used
anymore when they should be. Pinging here, since that thread got no
response so far.
[1]: https://www.postgresql.org/message-id/[email protected]
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Support LIKE with nondeterministic collations
@ 2026-05-18 11:43 Nico Williams <[email protected]>
0 siblings, 0 replies; 13+ messages in thread
From: Nico Williams @ 2026-05-18 11:43 UTC (permalink / raw)
To: Peter Eisentraut <[email protected]>; +Cc: Daniel Verite <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers
On Fri, May 03, 2024 at 08:58:30PM +0200, Peter Eisentraut wrote:
> On 03.05.24 17:47, Daniel Verite wrote:
> > Peter Eisentraut wrote:
> >
> > > However, off the top of my head, this definition has three flaws: (1)
> > > It would make the single-character wildcard effectively an
> > > any-number-of-characters wildcard, but only in some circumstances, which
> > > could be confusing, (2) it would be difficult to compute, because you'd
> > > have to check equality against all possible single-character strings,
> > > and (3) it is not what the SQL standard says.
> >
> > For #1 we're currently using the definition of a "character" as
> > being any single point of code,
>
> That is the definition that is used throughout SQL and PostgreSQL. We can't
> change that without redefining everything. To pick just one example, the
> various trim function also behave in seemingly inconsistent ways when you
> apply then to strings in different normalization forms. The better fix there
> is to enforce the normalization form somehow.
That's unfortunate, because in Unicode a character (see below) need not
be represented as a single code point, or even be possible to represent
using a single code point -- multiple codepoints can be required even in
order to represent a character. This means that substring operations
need to exist that respect character boundaries rather than code point
boundaries.
Changing functions like `overlay()`, `position()` and `substring()` to
count characters rather than codepoints would be a potentially breaking
change. Or maybe a 'fixing change', depending on how one looks at it!
But changing LIKE to respect character boundaries, especially in cases
where LIKE would not have worked to begin with before this change, seems
safe to me -- certainly it seems desirable. The example Daniel posted
definitely shows how interpreting 'character' to mean 'code point'
yields broken behaviors as seen by _humans_[0].
[0] And also by LLMs, though since those are [mainly / for now] trained
on human-generated corpuses, these are human-like too!
> > Intuitively I think that our interpretation of "character" here should
> > be whatever sequence of code points are between character
> > boundaries [1], and that the equality of such characters would be the
> > equality of their sequences of code points, with the string equality
> > check of the collation, whatever the length of these sequences.
>
> > [1]:
> > https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary
>
> Even that page says, what we are calling character here is really called a
> grapheme cluster.
Not quite.
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
| A single Unicode code point is often, but not always the same as a
| basic unit of a writing system for a language, or what a typical user
| might think of as a “character”. There are many cases where such a basic
| unit is made up of multiple Unicode code points. To avoid ambiguity with
| the term character as defined for encoding purposes, it can be useful to
| speak of a user-perceived character. For example, “G” + grave-accent is
| a user-perceived character: users think of it as a single character, yet
| is actually represented by two Unicode code points.
Here the spec refers to "user-perceived character", and leaves
'character' rather ill-defined, but then:
https://www.unicode.org/glossary/
The Unicode glossary says:
| Code Point. (1) Any value in the Unicode codespace; that is, the range
| of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4,
| Characters and Encoding.) Not all code points are assigned to encoded
| characters. See code point type. (2) A value, or position, for a
| character, in any coded character set.
| Character. (1) The smallest component of written language that has
| semantic value; refers to the abstract meaning and/or shape, rather than
| a specific shape (see also glyph), though in code tables some form of
| visual representation is essential for the reader’s understanding. (2)
| Synonym for abstract character. (3) The basic unit of encoding for the
| Unicode character encoding. (4) The English name for the ideographic
| written elements of Chinese origin. [See ideograph (2).]
Note that a combining mark code point is _not_ a character by this
definition (1), though (2):
| Abstract Character. A unit of information used for the organization,
| control, or representation of textual data. (See definition D7 in
| Section 3.4, Characters and Encoding.)
So control characters are also characters. Makes sense: newline and
carriage return are meaningful to users of Latin scripts.
For characters other than 'control' ones this does make it clear that a
character is the coded representation of what we see as a 'glyph'.
| Glyph. (1) An abstract form that represents one or more glyph images.
| (2) A synonym for glyph image. In displaying Unicode character data, one
| or more glyphs may be selected to depict a particular character. These
| glyphs are selected by a rendering engine during composition and layout
| processing. (See also character.)
|
| Grapheme. (1) A minimally distinctive unit of writing in the context of
| a particular writing system. For example, ‹b› and ‹d› are distinct
| graphemes in English writing systems because there exist distinct words
| like big and dig. Conversely, a lowercase italiform letter a and a
| lowercase Roman letter a are not distinct graphemes because no word is
| distinguished on the basis of these two different forms. (2) What a user
| thinks of as a character.
|
| Grapheme Cluster. The text between grapheme cluster boundaries as
| specified by Unicode Standard Annex #29, "Unicode Text Segmentation."
| (See definition D60 in Section 3.6, Combination.) A grapheme cluster
| represents a horizontally segmentable unit of text, consisting of some
| grapheme base (which may consist of a Korean syllable) together with any
| number of nonspacing marks applied to it.
> In a different world, pattern matching, character trimming, etc. would work
> by grapheme, but it does not.
In this very particular case it should though, and it wouldn't be a
breaking change because:
- LIKE with non-deterministic collations is not yet supported, so you
wouldn't be changing any existing behaviors in a breaking manner
- there is no PG function for partitioning a string into substrings
using LIKE patterns (regular expressions, yes) (I think!)
One could argue that counting graphemes rather than code points can only
_fix_ bugs rather than break code, and I want to make that argument, but
it's a harder argument to make, so I won't make it today.
Nico
--
^ permalink raw reply [nested|flat] 13+ messages in thread
end of thread, other threads:[~2026-05-18 11:51 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-04-30 12:39 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-02 13:38 ` Peter Eisentraut <[email protected]>
2024-05-03 00:11 ` Robert Haas <[email protected]>
2024-05-03 08:52 ` Peter Eisentraut <[email protected]>
2024-05-03 13:20 ` Robert Haas <[email protected]>
2024-05-03 14:12 ` Peter Eisentraut <[email protected]>
2024-05-03 14:58 ` Daniel Verite <[email protected]>
2024-05-03 18:53 ` Peter Eisentraut <[email protected]>
2024-07-31 22:26 ` Jeff Davis <[email protected]>
2026-05-18 11:51 ` Nico Williams <[email protected]>
2024-06-28 06:31 ` Peter Eisentraut <[email protected]>
2026-05-18 10:23 ` Jelte Fennema-Nio <[email protected]>
2026-05-18 11:43 Re: Support LIKE with nondeterministic collations Nico Williams <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox