public inbox for [email protected]
help / color / mirror / Atom feedSorting by respecting diacritics/accents
2+ messages / 2 participants
[nested] [flat]
* Sorting by respecting diacritics/accents
@ 2025-07-25 10:05 JānisE <[email protected]>
0 siblings, 1 reply; 2+ messages in thread
From: JānisE @ 2025-07-25 10:05 UTC (permalink / raw)
To: [email protected] <[email protected]>
Hello! I seem to not be able to get PostgreSQL to sort rows by a string column respecting the diacritics. I read [1] that it's possible to define a custom collation having collation strength "ks" set to "level2", which would mean that it's accent-sensitive. However, when I try to actually sort using that collation, the order seem to be accent-insensitive. For example: CREATE TABLE test (string text); INSERT INTO test VALUES ('bar'), ('bat'), ('bär'); CREATE COLLATION "und1" (provider = icu, deterministic = false, locale = 'und-u-ks-level1'); CREATE COLLATION "und2" (provider = icu, deterministic = false, locale = 'und-u-ks-level2'); CREATE COLLATION "und3" (provider = icu, deterministic = false, locale = 'und-u-ks-level3'); SELECT * FROM test ORDER BY string collate "und1"; SELECT * FROM test ORDER BY string collate "und2"; SELECT * FROM test ORDER BY string collate "und3"; All three collations give me the same order: bar < bär < bat, although an accent-sensitive order would be bar < bat < bär This does lose "bär", meaning that those strength levels do have some kind of an effect on "DISTINCT": SELECT DISTINCT string COLLATE "und1" FROM test; But it's not working on "ORDER BY". Do I misunderstand the collation capabilities? Is there a way to actually get an accent-sensitive order? Also, is there a way to see what options are there for the default built-in collations? I don't see, for example, the used "ks" level in the "pg_collation" table data. Best regards, Janis [1] https://www.postgresql.org/docs/current/collation.html#ICU-COLLATION-COMPARISON-LEVELS
^ permalink raw reply [nested|flat] 2+ messages in thread
* Re: Sorting by respecting diacritics/accents
@ 2025-07-25 14:23 Laurenz Albe <[email protected]>
parent: JānisE <[email protected]>
0 siblings, 0 replies; 2+ messages in thread
From: Laurenz Albe @ 2025-07-25 14:23 UTC (permalink / raw)
To: JānisE <[email protected]>; [email protected] <[email protected]>
On Fri, 2025-07-25 at 13:05 +0300, JānisE wrote:
> I seem to not be able to get PostgreSQL to sort rows by a string column respecting the diacritics.
>
> I read [1] that it's possible to define a custom collation having collation strength "ks"
> set to "level2", which would mean that it's accent-sensitive.
>
> However, when I try to actually sort using that collation, the order seem to be accent-insensitive.
>
> For example:
>
> CREATE TABLE test (string text);
> INSERT INTO test VALUES ('bar'), ('bat'), ('bär');
> CREATE COLLATION "und1" (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
> CREATE COLLATION "und2" (provider = icu, deterministic = false, locale = 'und-u-ks-level2');
> CREATE COLLATION "und3" (provider = icu, deterministic = false, locale = 'und-u-ks-level3');
> SELECT * FROM test ORDER BY string collate "und1";
> SELECT * FROM test ORDER BY string collate "und2";
> SELECT * FROM test ORDER BY string collate "und3";
>
> All three collations give me the same order: bar < bär < bat, although an accent-sensitive
> order would be bar < bat < bär
>
> This does lose "bär", meaning that those strength levels do have some kind of an effect on "DISTINCT":
> SELECT DISTINCT string COLLATE "und1" FROM test;
>
> But it's not working on "ORDER BY".
>
> Do I misunderstand the collation capabilities? Is there a way to actually get an accent-sensitive order?
Yes, I thing you misunderstand what "accent sensitive" means.
It means that 'bar' <> 'bär'.
Natural language collations compare strings on different levels:
- 'bar' and 'bär' are identical on the first level (base character)
- 'bar' and 'bär' are different on the second level (accent)
- there are two more levels, the third being case
Strings are ordered by the first level first, then by the second, and so on.
I recommend reading Peter's excellent blog:
http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings
So you end up with 'bar' < 'bär' < 'bat', because the first two compare
equal on level 1.
What you are looking for is a collation where accents are a first-level
difference. The only way to do that with ICU collations, as far as I know,
is to add explicit rules, like in this example:
https://stackoverflow.com/a/77288282/6464308
> Also, is there a way to see what options are there for the default built-in collations?
> I don't see, for example, the used "ks" level in the "pg_collation" table data.
You can see that in the "colllocale" column. The name of the ICU locale
determines its capabilities.
Yours,
Laurenz Albe
^ permalink raw reply [nested|flat] 2+ messages in thread
end of thread, other threads:[~2025-07-25 14:23 UTC | newest]
Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-25 10:05 Sorting by respecting diacritics/accents JānisE <[email protected]>
2025-07-25 14:23 ` Laurenz Albe <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox