public inbox for [email protected]  
help / color / mirror / Atom feed
From: Masahiko Sawada <[email protected]>
To: Tom Lane <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Aleksander Alekseev <[email protected]>
Cc: pgsql-hackers <[email protected]>
Cc: Chao Li <[email protected]>
Cc: Dagfinn Ilmari Mannsåker <[email protected]>
Cc: Andrey Borodin <[email protected]>
Subject: Re: Add uuid_to_base32hex() and base32hex_to_uuid() built-in functions
Date: Wed, 25 Mar 2026 18:21:38 -0700
Message-ID: <CAD21AoC=92zAEmaGt+3U1jSD77jgdQT_HXUwPPtZEze2sJ=vWA@mail.gmail.com> (raw)
In-Reply-To: <CAD21AoAwX1D6baSGuQXm0mzPXPWB07kgaoaaahjNHHenbdY24A@mail.gmail.com>
References: <[email protected]>
	<[email protected]>
	<CAJ7c6TNCGT2eZ8JttAZKUhVv3jJFTHvjFw6typzSGkM5_vePow@mail.gmail.com>
	<CAD21AoCcikUgU5coyvM7PLJg5M-75LzQhm-Uxmqn_7M=DHkj+w@mail.gmail.com>
	<CAJ7c6TPu7NGrSj-V2-ns0GTNGx9cmoKQgzt-p2MpJepiGjQnYw@mail.gmail.com>
	<CAD21AoC2Ooy=W=4YjQ_GO5nEYZ2tsn43=TkLf7ysE-96ygohKg@mail.gmail.com>
	<[email protected]>
	<CAD21AoA7UMWG8YO2Edf1UWSdtaof6gOnF98bJGoz1z1JBzozuQ@mail.gmail.com>
	<CAJ7c6TMd4ebu7oha2C9QZ_aKNE5-keEJ=bE_k_-vrMuw2p1LNg@mail.gmail.com>
	<CAJ7c6TNzQES3EhzoftNUEg=mfi69SYEQ=Fh-=rr84b7zHECHeg@mail.gmail.com>
	<CAD21AoAj+KOc74wOr04oGo9ySYmf_TYkMi5fByLF80cqyZmB=g@mail.gmail.com>
	<[email protected]>
	<CAJ7c6TN7fjWDWSVmhQuNCPYLVL9cs9ifc4D_Hj6wUQd2Lw2oNQ@mail.gmail.com>
	<CAD21AoAMST1V1OE=yvg3vQn=p6DDJBAaaFDx39z=P-+Z9iXwOQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<CAD21AoAwX1D6baSGuQXm0mzPXPWB07kgaoaaahjNHHenbdY24A@mail.gmail.com>

On Wed, Mar 25, 2026 at 6:09 PM Masahiko Sawada <[email protected]> wrote:
>
> On Wed, Mar 25, 2026 at 5:35 PM Tom Lane <[email protected]> wrote:
> >
> > Tomas Vondra <[email protected]> writes:
> > > On 3/26/26 00:40, Tom Lane wrote:
> > >> I believe what's happening there is that in cs_CZ locale,
> > >> "V" doesn't follow simple ASCII sort ordering.
> >
> > > With cs_CZ all letters sort *before* numbers, while in en_US it's the
> > > other way around. V is not special in any way.
> >
> > Ah, sorry, I should have researched a bit instead of relying on
> > fading memory.  The quirk I was thinking of is that in cs_CZ,
> > "ch" sorts after "h":
> >
> > u8=# select 'h' < 'ch'::text collate "en_US";
> >  ?column?
> > ----------
> >  f
> > (1 row)
> >
> > u8=# select 'h' < 'ch'::text collate "cs_CZ";
> >  ?column?
> > ----------
> >  t
> > (1 row)
> >
> > Regular hex encoding isn't bitten by that because it doesn't
> > use 'h' in the text form ... but this base32hex thingie does.
> >
> > However, your point is also correct:
> >
> > u8=# select '0' < 'C'::text ;
> >  ?column?
> > ----------
> >  t
> > (1 row)
> >
> > u8=# select '0' < 'C'::text collate "cs_CZ";
> >  ?column?
> > ----------
> >  f
> > (1 row)
> >
> > and that breaks "text ordering matches numeric ordering"
> > for both traditional hex and base32hex.  So maybe this
> > is not as big a deal as I first thought.  We need a fix
> > for the new test though.  Probably adding COLLATE "C"
> > would be enough.
>
> Thank you for the report and the analysis.
>
> I've reproduced the issue with "cs_CZ" collation and adding COLLATE
> "C" to the query resolves it. It seems also a good idea to add a note
> in the documentation too as users might face the same issue. For
> example,
>
> To maintain the lexicographical sort order of the encoded data, ensure
> that the text is sorted using the C collation (e.g., using COLLATE
> "C"). Natural language collations may sort characters differently and
> break the ordering.
>

Attached the patch doing the above idea.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Attachments:

  [text/x-patch] 0001-Fix-UUID-sortability-tests-in-base32hex-encoding.patch (3.6K, 2-0001-Fix-UUID-sortability-tests-in-base32hex-encoding.patch)
  download | inline diff:
From a64f3f64a9f04c1f5da9a51fe760c40480585fd4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Wed, 25 Mar 2026 18:07:11 -0700
Subject: [PATCH] Fix UUID sortability tests in base32hex encoding.

The recently added test for base32hex encoding of UUIDs failed on
buildfarm member hippopotamus using natural language locales (such as
cs_CZ). This happened because those collations may sort characters
differently, which breaks the strict byte-wise lexicographical
ordering expected by base32hex encoding.

This commit fixes the regression tests by explicitly using the C
collation. Additionally, add a note to the documentation to warm users
that they must use the C collation if they want to maintain the
lexicographical sort order of the encoded data.

Per buildfarm member hippopotamus.

Analyzed-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 doc/src/sgml/func/func-binarystring.sgml | 9 +++++++++
 src/test/regress/expected/uuid.out       | 7 +++++--
 src/test/regress/sql/uuid.sql            | 7 +++++--
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/func/func-binarystring.sgml b/doc/src/sgml/func/func-binarystring.sgml
index 0aaf9bc68f1..2ad2cdbea82 100644
--- a/doc/src/sgml/func/func-binarystring.sgml
+++ b/doc/src/sgml/func/func-binarystring.sgml
@@ -790,6 +790,15 @@
        produces a 26-character string compared to the standard 36-character
        UUID representation.
       </para>
+
+      <note>
+       <para>
+        To maintain the lexicographical sort order of the encoded data,
+        ensure that the text is sorted using the C collation
+        (e.g., using <literal>COLLATE "C"</literal>). Natural language
+        collations may sort characters differently and break the ordering.
+       </para>
+      </note>
      </listitem>
     </varlistentry>
 
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 142c529e693..9c5dda9e9ab 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -236,8 +236,11 @@ SELECT array_agg(id ORDER BY guid_field) FROM guid3;
  {1,2,3,4,5,6,7,8,9,10,11,12}
 (1 row)
 
--- make sure base32hex encoding works with UUIDs and preserves ordering
-SELECT array_agg(id ORDER BY guid_encoded) FROM guid3;
+-- Test base32hex encoding of UUIDs and its lexicographical sorting property.
+-- COLLATE "C" is required to prevent buildfarm failures in non-C locales
+-- where natural language collations (such as cs_CZ) would break strict
+-- byte-wise ordering.
+SELECT array_agg(id ORDER BY guid_encoded COLLATE "C") FROM guid3;
           array_agg           
 ------------------------------
  {1,2,3,4,5,6,7,8,9,10,11,12}
diff --git a/src/test/regress/sql/uuid.sql b/src/test/regress/sql/uuid.sql
index f2ff00f5ddd..8cc2ad40614 100644
--- a/src/test/regress/sql/uuid.sql
+++ b/src/test/regress/sql/uuid.sql
@@ -122,8 +122,11 @@ INSERT INTO guid3 (guid_field) SELECT uuidv7() FROM generate_series(1, 10);
 INSERT INTO guid3 (guid_field) VALUES ('ffffffff-ffff-ffff-ffff-ffffffffffff'::uuid);
 SELECT array_agg(id ORDER BY guid_field) FROM guid3;
 
--- make sure base32hex encoding works with UUIDs and preserves ordering
-SELECT array_agg(id ORDER BY guid_encoded) FROM guid3;
+-- Test base32hex encoding of UUIDs and its lexicographical sorting property.
+-- COLLATE "C" is required to prevent buildfarm failures in non-C locales
+-- where natural language collations (such as cs_CZ) would break strict
+-- byte-wise ordering.
+SELECT array_agg(id ORDER BY guid_encoded COLLATE "C") FROM guid3;
 
 -- Check the timestamp offsets for v7.
 --
-- 
2.53.0



view thread (63+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Add uuid_to_base32hex() and base32hex_to_uuid() built-in functions
  In-Reply-To: <CAD21AoC=92zAEmaGt+3U1jSD77jgdQT_HXUwPPtZEze2sJ=vWA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox