Re: Support LIKE with nondeterministic collations

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Support LIKE with nondeterministic collations
5+ messages / 4 participants
[nested] [flat]

* Re: Support LIKE with nondeterministic collations
@ 2024-05-03 15:47  Daniel Verite <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Daniel Verite @ 2024-05-03 15:47 UTC (permalink / raw)
  To: Peter Eisentraut <[email protected]>; +Cc: Robert Haas <[email protected]>; pgsql-hackers

	Peter Eisentraut wrote:

>  However, off the top of my head, this definition has three flaws: (1) 
> It would make the single-character wildcard effectively an 
> any-number-of-characters wildcard, but only in some circumstances, which 
> could be confusing, (2) it would be difficult to compute, because you'd 
> have to check equality against all possible single-character strings, 
> and (3) it is not what the SQL standard says.

For #1 we're currently using the definition of a "character" as 
being any single point of code, but this definition fits poorly
with non-deterministic collation rules.

The simplest illustration I can think of is the canonical
equivalence match between the NFD and NFC forms of an
accented character.

postgres=# CREATE COLLATION nd (
  provider = 'icu',
  locale = 'und',
  deterministic = false
);		       

-- match NFD form with NFC form of eacute

postgres=# select U&'e\0301' like 'é' collate nd;
 ?column? 
----------
 t

postgres=# select U&'e\0301' like '_' collate nd;
 ?column? 
----------
 f
(1 row)

I understand why the algorithm produces these opposite results.
But at the semantic level, when asked if the left-hand string matches
a specific character, it says yes, and when asked if it matches any
character, it says no.
To me it goes beyond counter-intuitive, it's not reasonable enough to
be called correct.

What could we do about it?
Intuitively I think that our interpretation of "character" here should
be whatever sequence of code points are between character
boundaries [1], and that the equality of such characters would be the
equality of their sequences of code points, with the string equality
check of the collation, whatever the length of these sequences.

[1]:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Support LIKE with nondeterministic collations
@ 2024-05-03 18:58  Peter Eisentraut <[email protected]>
  parent: Daniel Verite <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Peter Eisentraut @ 2024-05-03 18:58 UTC (permalink / raw)
  To: Daniel Verite <[email protected]>; +Cc: Robert Haas <[email protected]>; pgsql-hackers

On 03.05.24 17:47, Daniel Verite wrote:
> 	Peter Eisentraut wrote:
> 
>>   However, off the top of my head, this definition has three flaws: (1)
>> It would make the single-character wildcard effectively an
>> any-number-of-characters wildcard, but only in some circumstances, which
>> could be confusing, (2) it would be difficult to compute, because you'd
>> have to check equality against all possible single-character strings,
>> and (3) it is not what the SQL standard says.
> 
> For #1 we're currently using the definition of a "character" as
> being any single point of code,

That is the definition that is used throughout SQL and PostgreSQL.  We 
can't change that without redefining everything.  To pick just one 
example, the various trim function also behave in seemingly inconsistent 
ways when you apply then to strings in different normalization forms. 
The better fix there is to enforce the normalization form somehow.

> Intuitively I think that our interpretation of "character" here should
> be whatever sequence of code points are between character
> boundaries [1], and that the equality of such characters would be the
> equality of their sequences of code points, with the string equality
> check of the collation, whatever the length of these sequences.
> 
> [1]:
> https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Even that page says, what we are calling character here is really called 
a grapheme cluster.

In a different world, pattern matching, character trimming, etc. would 
work by grapheme, but it does not.







^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Support LIKE with nondeterministic collations
@ 2026-05-18 10:23  Jelte Fennema-Nio <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Jelte Fennema-Nio @ 2026-05-18 10:23 UTC (permalink / raw)
  To: Peter Eisentraut <[email protected]>; +Cc: jian he <[email protected]>; Heikki Linnakangas <[email protected]>; Jacob Champion <[email protected]>; pgsql-hackers; Daniel Verite <[email protected]>; Paul A Jungwirth <[email protected]>

On Wed, 27 Nov 2024 at 09:01, Peter Eisentraut <[email protected]> wrote:
> I have committed it, thanks.

This commit introduced a regression[1], which makes psql \d and our
tests much slower than they need to be, because indexes are not used
anymore when they should be. Pinging here, since that thread got no
response so far.

[1]:  https://www.postgresql.org/message-id/[email protected]






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Support LIKE with nondeterministic collations
@ 2026-05-18 11:43  Nico Williams <[email protected]>
  parent: Peter Eisentraut <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Nico Williams @ 2026-05-18 11:43 UTC (permalink / raw)
  To: Peter Eisentraut <[email protected]>; +Cc: Daniel Verite <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers

On Fri, May 03, 2024 at 08:58:30PM +0200, Peter Eisentraut wrote:
> On 03.05.24 17:47, Daniel Verite wrote:
> > 	Peter Eisentraut wrote:
> > 
> > >   However, off the top of my head, this definition has three flaws: (1)
> > > It would make the single-character wildcard effectively an
> > > any-number-of-characters wildcard, but only in some circumstances, which
> > > could be confusing, (2) it would be difficult to compute, because you'd
> > > have to check equality against all possible single-character strings,
> > > and (3) it is not what the SQL standard says.
> > 
> > For #1 we're currently using the definition of a "character" as
> > being any single point of code,
> 
> That is the definition that is used throughout SQL and PostgreSQL.  We can't
> change that without redefining everything.  To pick just one example, the
> various trim function also behave in seemingly inconsistent ways when you
> apply then to strings in different normalization forms. The better fix there
> is to enforce the normalization form somehow.

That's unfortunate, because in Unicode a character (see below) need not
be represented as a single code point, or even be possible to represent
using a single code point -- multiple codepoints can be required even in
order to represent a character.  This means that substring operations
need to exist that respect character boundaries rather than code point
boundaries.

Changing functions like `overlay()`, `position()` and `substring()` to
count characters rather than codepoints would be a potentially breaking
change.  Or maybe a 'fixing change', depending on how one looks at it!

But changing LIKE to respect character boundaries, especially in cases
where LIKE would not have worked to begin with before this change, seems
safe to me -- certainly it seems desirable.  The example Daniel posted
definitely shows how interpreting 'character' to mean 'code point'
yields broken behaviors as seen by _humans_[0].

[0] And also by LLMs, though since those are [mainly / for now] trained
    on human-generated corpuses, these are human-like too!

> > Intuitively I think that our interpretation of "character" here should
> > be whatever sequence of code points are between character
> > boundaries [1], and that the equality of such characters would be the
> > equality of their sequences of code points, with the string equality
> > check of the collation, whatever the length of these sequences.
> 
> > [1]:
> > https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary
> 
> Even that page says, what we are calling character here is really called a
> grapheme cluster.

Not quite.

https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

| A single Unicode code point is often, but not always the same as a
| basic unit of a writing system for a language, or what a typical user
| might think of as a “character”. There are many cases where such a basic
| unit is made up of multiple Unicode code points. To avoid ambiguity with
| the term character as defined for encoding purposes, it can be useful to
| speak of a user-perceived character. For example, “G” + grave-accent is
| a user-perceived character: users think of it as a single character, yet
| is actually represented by two Unicode code points.

Here the spec refers to "user-perceived character", and leaves
'character' rather ill-defined, but then:

https://www.unicode.org/glossary/

The Unicode glossary says:

| Code Point. (1) Any value in the Unicode codespace; that is, the range
| of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4,
| Characters and Encoding.) Not all code points are assigned to encoded
| characters. See code point type. (2) A value, or position, for a
| character, in any coded character set.

| Character. (1) The smallest component of written language that has
| semantic value; refers to the abstract meaning and/or shape, rather than
| a specific shape (see also glyph), though in code tables some form of
| visual representation is essential for the reader’s understanding. (2)
| Synonym for abstract character. (3) The basic unit of encoding for the
| Unicode character encoding. (4) The English name for the ideographic
| written elements of Chinese origin. [See ideograph (2).]

Note that a combining mark code point is _not_ a character by this
definition (1), though (2):

| Abstract Character. A unit of information used for the organization,
| control, or representation of textual data. (See definition D7 in
| Section 3.4, Characters and Encoding.)

So control characters are also characters.  Makes sense: newline and
carriage return are meaningful to users of Latin scripts.

For characters other than 'control' ones this does make it clear that a
character is the coded representation of what we see as a 'glyph'.

| Glyph. (1) An abstract form that represents one or more glyph images.
| (2) A synonym for glyph image. In displaying Unicode character data, one
| or more glyphs may be selected to depict a particular character. These
| glyphs are selected by a rendering engine during composition and layout
| processing. (See also character.)
| 
| Grapheme. (1) A minimally distinctive unit of writing in the context of
| a particular writing system. For example, ‹b› and ‹d› are distinct
| graphemes in English writing systems because there exist distinct words
| like big and dig. Conversely, a lowercase italiform letter a and a
| lowercase Roman letter a are not distinct graphemes because no word is
| distinguished on the basis of these two different forms. (2) What a user
| thinks of as a character.
| 
| Grapheme Cluster. The text between grapheme cluster boundaries as
| specified by Unicode Standard Annex #29, "Unicode Text Segmentation."
| (See definition D60 in Section 3.6, Combination.) A grapheme cluster
| represents a horizontally segmentable unit of text, consisting of some
| grapheme base (which may consist of a Korean syllable) together with any
| number of nonspacing marks applied to it.

> In a different world, pattern matching, character trimming, etc. would work
> by grapheme, but it does not.

In this very particular case it should though, and it wouldn't be a
breaking change because:

 - LIKE with non-deterministic collations is not yet supported, so you
   wouldn't be changing any existing behaviors in a breaking manner

 - there is no PG function for partitioning a string into substrings
   using LIKE patterns (regular expressions, yes)  (I think!)

One could argue that counting graphemes rather than code points can only
_fix_ bugs rather than break code, and I want to make that argument, but
it's a harder argument to make, so I won't make it today.

Nico
-- 

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Support LIKE with nondeterministic collations
@ 2026-05-18 11:51  Nico Williams <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Nico Williams @ 2026-05-18 11:51 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: Daniel Verite <[email protected]>; Peter Eisentraut <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers

On Wed, Jul 31, 2024 at 03:26:34PM -0700, Jeff Davis wrote:
> On Fri, 2024-05-03 at 16:58 +0200, Daniel Verite wrote:
> > In other words it says that
> > 
> >   col LIKE 'smith%' collate "nd"
> > 
> > is equivalent to:
> > 
> >   col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
> 
> That logic seems to assume something about the collation. If you have a
> collation that orders strings by their sha256 hash, that would entirely
> break the connection between prefixes and ranges, and it wouldn't work.

The hash of what? each character's names or canonical representations in
some UTF?  If so, then, to maintain the above equivalence one would have
to alter the definition of this 'hash-based collation' so that U+FFFF is
always "last".

> Is there something about the way collations are defined that inherently
> maintains a connection between a prefix and a range? [...]

Yes: rules like the one Daniel described.

>                                               [...]? Does it remain
> true even when strange rules are added to a collation?

There are 'strange rules' which cannot be used in defining a collation,
as the result would not then be a collation.

Nico
-- 






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2026-05-18 11:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-05-03 15:47 Re: Support LIKE with nondeterministic collations Daniel Verite <[email protected]>
2024-05-03 18:58 ` Peter Eisentraut <[email protected]>
2026-05-18 11:43   ` Nico Williams <[email protected]>
2026-05-18 10:23 Re: Support LIKE with nondeterministic collations Jelte Fennema-Nio <[email protected]>
2026-05-18 11:51 Re: Support LIKE with nondeterministic collations Nico Williams <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox