Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufJL1-005LCt-8e for pgsql-general@arkaria.postgresql.org; Fri, 25 Jul 2025 14:23:56 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ufJL0-001Lyz-A5 for pgsql-general@arkaria.postgresql.org; Fri, 25 Jul 2025 14:23:54 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufJKz-001Lyq-TS for pgsql-general@lists.postgresql.org; Fri, 25 Jul 2025 14:23:54 +0000 Received: from mail-wr1-x433.google.com ([2a00:1450:4864:20::433]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1ufJKy-000jPt-0G for pgsql-general@lists.postgresql.org; Fri, 25 Jul 2025 14:23:52 +0000 Received: by mail-wr1-x433.google.com with SMTP id ffacd0b85a97d-3a531fcaa05so1058337f8f.3 for ; Fri, 25 Jul 2025 07:23:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cybertec.at; s=google; t=1753453430; x=1754058230; darn=lists.postgresql.org; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:to:from:subject:message-id:from:to:cc:subject:date :message-id:reply-to; bh=hB7URsNxZ64k3YzQdb+YMHV+Kju64AIAucDvH5nad0o=; b=fB1ikN1dIYDYkYmvUuF4R30urA7pl/DXsAZ89Mj9JOYhcsB3XtWfUI+KMw3fwpIMCa dN7um0ILHJWZGXTecU9dOoJQqe+SAxUfAKW3BURjQWrYQ1gKd8cFz06+DQjUr+vhhUpH hV6bzYMsTzOKtInUPjIhqH+WeuWrEGVVxq5Ytb6kZJiPrUjMvMVme+qJpyBp+taLNWkz DvQoitTeN5p8r6DoWvQrJw3qkBhS+Fj+J7Ds6DXpfm8BGMu46WWc/hXWEkZPML2vcrDD HmoV6YUoGnbichH3qmMzv/6ERGHwva9yIeTW0IG/OoBD0MwqBBHR+fEdLcmdIWivx9BU dzbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753453430; x=1754058230; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:to:from:subject:message-id:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=hB7URsNxZ64k3YzQdb+YMHV+Kju64AIAucDvH5nad0o=; b=QbTvGPyc3l5IKyxsxRYq3As/DfPKD5TzoDrTWMyB9Nwx5Vpaw6RwaI7izRvy1Zw4tx hgqstFytMH+q1F6DgKmg2+zRF0wEMD69/BKy7HGylh0zEV4QKJRMvVjtKxuDtXgGP42s c6JH7Aw4jTAlDDBaxHZO7ECdxCiPtTQVCZX58+REG1Sfove+EEic2xCjpEok1sYOkhaC 5lmCe+Jz9mB54tWgoWVoE3PQ6sqJ2zw/CZZyH0e7L3AX4LgbcxbIbkQv3DHptfsD9QfL odnhbNFKNtwnS6i80izXRFphv9OoPTjcRq89enGYcAI1kSiC/zZCQtHiHYMApflHqTRO d+1w== X-Forwarded-Encrypted: i=1; AJvYcCUBwa/lpTLUzMl6aKxHNSrPBUhn/cTSbENPZ72xz+XRTtn7jgHQM3oFbPDpa/7md9z0/sDa5DeqwTrbqtlI@lists.postgresql.org X-Gm-Message-State: AOJu0YxJPHJfIyc+4kFz2Nw7vLXPsGzZRQ2aNw7zrNjK3Hw+AYcLGk+x afXXv04fB2ts/obR6jNfB4rH++XkO4dh+MKB+iKjHnClKG3ovA4VDIWMQknh2aIXmW303fcErZ1 F3n/r X-Gm-Gg: ASbGnctJBp+EIC843k6zFrKPJU1K+cJWnPAaBV9UdXUraxVQfZqgN3yMAuot2Levs4g P4O8pqHvEf5gkGmy7lVRJsnDHnC35Qke3u7gh3elJ5AR6cBp2Qbc92SeSmBVxLcPMXxmaEHZiLP //MZf+0Bls9/nD3RXrzDZ1XJqz4MjOkvZfBbZyAH/ZPT7l+H3WFhZtISikKcjF6u4hKfCMiIOFr JAkDxpUPlmowUpoU+jxC0bjLHcXbtnWI0OETdIUjJw3lnXI7MKH/ImAT5ZxcuFG4BXqLQgG6qFl 20FlD3sUTLKYnfq3aPOHCP769O69UrSDV80wK0QA+CCbruh3YD+uMRQNnYf2zYHo5HaD4N+zs48 Wsy0sb6bcJWuyVZZl+BAvmh1NYiaZcuZjUbMNdP7kBK7egChyKR4= X-Google-Smtp-Source: AGHT+IE9KXItFQPcOjzjSmLO/Z7RbWuq3co2IFPHWxWvzAR53DdntMvsswgrWvxnDYVHAPztUV/ooA== X-Received: by 2002:a05:6000:2287:b0:3a3:6a9a:5ebf with SMTP id ffacd0b85a97d-3b7765f2b3bmr2043477f8f.20.1753453430029; Fri, 25 Jul 2025 07:23:50 -0700 (PDT) Received: from laurenz.albe-K4N0CV00F97414D ([2001:871:5e:b847:4859:2f5d:fab8:9595]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3b778f0c5dfsm34600f8f.63.2025.07.25.07.23.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Jul 2025 07:23:49 -0700 (PDT) Message-ID: Subject: Re: Sorting by respecting diacritics/accents From: Laurenz Albe To: =?UTF-8?Q?J=C4=81nisE?= , "pgsql-general@lists.postgresql.org" Date: Fri, 25 Jul 2025 16:23:48 +0200 In-Reply-To: <1814742357.238678.1753437917809@w8> References: <1814742357.238678.1753437917809@w8> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.56.2 (3.56.2-1.fc42) MIME-Version: 1.0 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, 2025-07-25 at 13:05 +0300, J=C4=81nisE wrote: > I seem to not be able to get PostgreSQL to sort rows by a string column r= especting the diacritics.=20 >=20 > I read [1] that it's possible to define a custom collation having collati= on strength "ks" > set to "level2", which would mean that it's accent-sensitive. >=20 > However, when I try to actually sort using that collation, the order seem= to be accent-insensitive. >=20 > For example: >=20 > =C2=A0CREATE TABLE test (string text);=20 > =C2=A0INSERT INTO test VALUES ('bar'), ('bat'), ('b=C3=A4r');=20 > =C2=A0CREATE COLLATION "und1" (provider =3D icu, deterministic =3D false,= locale =3D 'und-u-ks-level1');=20 > =C2=A0CREATE COLLATION "und2" (provider =3D icu, deterministic =3D false,= locale =3D 'und-u-ks-level2');=20 > =C2=A0CREATE COLLATION "und3" (provider =3D icu, deterministic =3D false,= locale =3D 'und-u-ks-level3');=20 > =C2=A0SELECT * FROM test ORDER BY string collate "und1";=20 > =C2=A0SELECT * FROM test ORDER BY string collate "und2";=20 > =C2=A0SELECT * FROM test ORDER BY string collate "und3"; >=20 > All three collations give me the same order: bar < b=C3=A4r < bat, althou= gh an accent-sensitive > order would be bar < bat < b=C3=A4r >=20 > This does lose "b=C3=A4r", meaning that those strength levels do have som= e kind of an effect on "DISTINCT":=20 > SELECT DISTINCT string COLLATE "und1" FROM test; >=20 > But it's not working on "ORDER BY". >=20 > Do I misunderstand the collation capabilities? Is there a way to actually= get an accent-sensitive order? Yes, I thing you misunderstand what "accent sensitive" means. It means that 'bar' <> 'b=C3=A4r'. Natural language collations compare strings on different levels: - 'bar' and 'b=C3=A4r' are identical on the first level (base character) - 'bar' and 'b=C3=A4r' are different on the second level (accent) - there are two more levels, the third being case Strings are ordered by the first level first, then by the second, and so on= . I recommend reading Peter's excellent blog: http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-setti= ngs So you end up with 'bar' < 'b=C3=A4r' < 'bat', because the first two compar= e equal on level 1. What you are looking for is a collation where accents are a first-level difference. The only way to do that with ICU collations, as far as I know, is to add explicit rules, like in this example: https://stackoverflow.com/a/77288282/6464308 > Also, is there a way to see what options are there for the default built-= in collations? > I don't see, for example, the used "ks" level in the "pg_collation" table= data.=20 You can see that in the "colllocale" column. The name of the ICU locale determines its capabilities. Yours, Laurenz Albe