Re: utf8 vs UTF-8

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: utf8 vs UTF-8
4+ messages / 3 participants
[nested] [flat]

* Re: utf8 vs UTF-8
@ 2024-05-17 13:51  Tom Lane <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Tom Lane @ 2024-05-17 13:51 UTC (permalink / raw)
  To: Troels Arvin <[email protected]>; +Cc: [email protected]

Troels Arvin <[email protected]> writes:
> In a Postgres installation, I have databases where the locale is 
> slightly different. Which one is correct? Excerpt from "psql --list":

>   test1       | loc_test | UTF8     | libc            | en_US.UTF-8 | 
> en_US.UTF-8
>   test3       | troels   | UTF8     | libc            | en_US.utf8 | 
> en_US.utf8

On most if not all platforms, both those spellings of the locale names
will be taken as valid.  You might try running "locale -a" to get an
idea of which one is preferred according to your current libc
installation ... but TBH, I doubt it's worth worrying about.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: utf8 vs UTF-8
@ 2024-05-18 14:48  Troels Arvin <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Troels Arvin @ 2024-05-18 14:48 UTC (permalink / raw)
  To: [email protected]; +Cc: Tom Lane <[email protected]>

Hello,

Tom Lane wrote:
 >>  test1  | loc_test | UTF8   | libc     | en_US.UTF-8 | en_US.UTF-8
 >>  test3  | troels   | UTF8   | libc     | en_US.utf8  | en_US.utf8
 >
 > On most if not all platforms, both those spellings of the locale names
 > will be taken as valid.  You might try running "locale -a" to get an
 > idea of which one is preferred according to your current libc
 > installation

"locale -a" on the Ubuntu system outputs this:

   C
   C.utf8
   en_US.utf8
   POSIX

On a CentOS7 system, it's sort-of the same:

   locale -a | grep -i en_us
   en_US
   en_US.iso88591
   en_US.iso885915
   en_US.utf8

So at first, I thought en_US.utf8 would be the most correct locale 
identifier. However, when I look at Postgres' own databases, they have 
the slightly different locale string:

   psql --list | grep -E 'postgres|template'
   postgres  | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
   template0 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
   template1 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...

Also, when I try to create a database with "en_US.utf8" as locale 
without specifying a template:

troels=# create database test4 locale 'en_US.utf8';
ERROR:  new collation (en_US.utf8) is incompatible with the collation of 
the template database (en_US.UTF-8)
HINT:  Use the same collation as in the template database, or use 
template0 as template.

Given the locale of Postgres' own databases and Postgres' error message, 
I'm leaning to en_US.UTF-8 being the most correct locale to use. Because 
why would Postgres care about it, if utf8/UTF-8 doesn't matter?

> but TBH, I doubt it's worth worrying about.

But couldn't there be an issue, if for example the client's locale and 
the server's locale aren't exactly the same? I'm thinking maybe the 
client library has to perform unneeded translation of the stream of data 
to/from the database?

-- 
Kind regards,
Troels

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: utf8 vs UTF-8
@ 2024-05-18 15:01  Adrian Klaver <[email protected]>
  parent: Troels Arvin <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Adrian Klaver @ 2024-05-18 15:01 UTC (permalink / raw)
  To: Troels Arvin <[email protected]>; [email protected]; +Cc: Tom Lane <[email protected]>

On 5/18/24 07:48, Troels Arvin wrote:
> Hello,
> 
> Tom Lane wrote:
>  >>  test1  | loc_test | UTF8   | libc     | en_US.UTF-8 | en_US.UTF-8
>  >>  test3  | troels   | UTF8   | libc     | en_US.utf8  | en_US.utf8
>  >
>  > On most if not all platforms, both those spellings of the locale names
>  > will be taken as valid.  You might try running "locale -a" to get an
>  > idea of which one is preferred according to your current libc
>  > installation
> 
> "locale -a" on the Ubuntu system outputs this:
> 
>    C
>    C.utf8
>    en_US.utf8
>    POSIX

If you expand that to locale -v -a you get:

locale: en_US.utf8      archive: /usr/lib/locale/locale-archive
-------------------------------------------------------------------------------
     title | English locale for the USA
    source | Free Software Foundation, Inc.
   address | https://www.gnu.org/software/libc/
     email | [email protected]
  language | American English
territory | United States
  revision | 1.0
      date | 2000-06-24
   codeset | UTF-8



> So at first, I thought en_US.utf8 would be the most correct locale 
> identifier. However, when I look at Postgres' own databases, they have 
> the slightly different locale string:
> 
>    psql --list | grep -E 'postgres|template'
>    postgres  | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
>    template0 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
>    template1 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
> 
> Also, when I try to create a database with "en_US.utf8" as locale 
> without specifying a template:
> 
> troels=# create database test4 locale 'en_US.utf8';
> ERROR:  new collation (en_US.utf8) is incompatible with the collation of 
> the template database (en_US.UTF-8)
> HINT:  Use the same collation as in the template database, or use 
> template0 as template.

I'm going to say that is Postgres being exact to a fault.

> 
> Given the locale of Postgres' own databases and Postgres' error message, 
> I'm leaning to en_US.UTF-8 being the most correct locale to use. Because 
> why would Postgres care about it, if utf8/UTF-8 doesn't matter?
> 
> 
>> but TBH, I doubt it's worth worrying about.
> 
> But couldn't there be an issue, if for example the client's locale and 
> the server's locale aren't exactly the same? I'm thinking maybe the 
> client library has to perform unneeded translation of the stream of data 
> to/from the database?



-- 
Adrian Klaver
[email protected]







^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: utf8 vs UTF-8
@ 2024-05-18 15:23  Tom Lane <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Tom Lane @ 2024-05-18 15:23 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: Troels Arvin <[email protected]>; [email protected]

Adrian Klaver <[email protected]> writes:
> On 5/18/24 07:48, Troels Arvin wrote:
>> Also, when I try to create a database with "en_US.utf8" as locale 
>> without specifying a template:
>> 
>> troels=# create database test4 locale 'en_US.utf8';
>> ERROR:  new collation (en_US.utf8) is incompatible with the collation of 
>> the template database (en_US.UTF-8)
>> HINT:  Use the same collation as in the template database, or use 
>> template0 as template.

> I'm going to say that is Postgres being exact to a fault.

Yeah.  glibc will treat those two locale names as equivalent,
and I think most if not all other libc implementations do too.
But Postgres doesn't know that so it demands exact textual
equality before assuming two locale names are equivalent.

If this is getting in your way you could probably get away with
just UPDATE-ing pg_database to use whichever spelling you think is
preferable; the strings appearing in datcollate and datctype aren't
stored anywhere else.  (But experiment in a scratch installation to
verify that ... and don't try changing them to something that you
don't know to be semantically equivalent.)

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2024-05-18 15:23 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-05-17 13:51 Re: utf8 vs UTF-8 Tom Lane <[email protected]>
2024-05-18 14:48 ` Troels Arvin <[email protected]>
2024-05-18 15:01   ` Adrian Klaver <[email protected]>
2024-05-18 15:23     ` Tom Lane <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox