Re: Sorting by respecting diacritics/accents

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Laurenz Albe <[email protected]>
To: JānisE <[email protected]>
To: [email protected] <[email protected]>
Subject: Re: Sorting by respecting diacritics/accents
Date: Fri, 25 Jul 2025 16:23:48 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <1814742357.238678.1753437917809@w8>
References: <1814742357.238678.1753437917809@w8>

On Fri, 2025-07-25 at 13:05 +0300, JānisE wrote:
> I seem to not be able to get PostgreSQL to sort rows by a string column respecting the diacritics. 
> 
> I read [1] that it's possible to define a custom collation having collation strength "ks"
> set to "level2", which would mean that it's accent-sensitive.
> 
> However, when I try to actually sort using that collation, the order seem to be accent-insensitive.
> 
> For example:
> 
>  CREATE TABLE test (string text); 
>  INSERT INTO test VALUES ('bar'), ('bat'), ('bär'); 
>  CREATE COLLATION "und1" (provider = icu, deterministic = false, locale = 'und-u-ks-level1'); 
>  CREATE COLLATION "und2" (provider = icu, deterministic = false, locale = 'und-u-ks-level2'); 
>  CREATE COLLATION "und3" (provider = icu, deterministic = false, locale = 'und-u-ks-level3'); 
>  SELECT * FROM test ORDER BY string collate "und1"; 
>  SELECT * FROM test ORDER BY string collate "und2"; 
>  SELECT * FROM test ORDER BY string collate "und3";
> 
> All three collations give me the same order: bar < bär < bat, although an accent-sensitive
> order would be bar < bat < bär
> 
> This does lose "bär", meaning that those strength levels do have some kind of an effect on "DISTINCT": 
> SELECT DISTINCT string COLLATE "und1" FROM test;
> 
> But it's not working on "ORDER BY".
> 
> Do I misunderstand the collation capabilities? Is there a way to actually get an accent-sensitive order?

Yes, I thing you misunderstand what "accent sensitive" means.
It means that 'bar' <> 'bär'.

Natural language collations compare strings on different levels:
- 'bar' and 'bär' are identical on the first level (base character)
- 'bar' and 'bär' are different on the second level (accent)
- there are two more levels, the third being case

Strings are ordered by the first level first, then by the second, and so on.

I recommend reading Peter's excellent blog:
http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings

So you end up with 'bar' < 'bär' < 'bat', because the first two compare
equal on level 1.

What you are looking for is a collation where accents are a first-level
difference.  The only way to do that with ICU collations, as far as I know,
is to add explicit rules, like in this example:
https://stackoverflow.com/a/77288282/6464308

> Also, is there a way to see what options are there for the default built-in collations?
> I don't see, for example, the used "ks" level in the "pg_collation" table data. 

You can see that in the "colllocale" column.  The name of the ICU locale
determines its capabilities.

Yours,
Laurenz Albe

view thread (2+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Sorting by respecting diacritics/accents
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox