public inbox for [email protected]  
help / color / mirror / Atom feed
From: Michael Downey <[email protected]>
To: Tom Lane <[email protected]>
Cc: [email protected] <[email protected]>
Subject: RE: PostgreSQL implicitly double-quoting identifier name with umlaut
Date: Fri, 13 Sep 2024 23:33:52 +0000
Message-ID: <BY3PR05MB798521235A8EAB02E2BDD885DB652@BY3PR05MB7985.namprd05.prod.outlook.com> (raw)
In-Reply-To: <[email protected]>
References: <BY3PR05MB7985A889E17333E35AF6B110DB652@BY3PR05MB7985.namprd05.prod.outlook.com>
	<[email protected]>

Thanks Tom!

I doubt we would want this changed. Our documentation and Dev team are on-board for the lower case folding and I really would not 
want it changed. We will make sure that we make a notice of this in our documentation as well.

Thank you
Michael

-----Original Message-----
From: Tom Lane <[email protected]> 
Sent: Friday, September 13, 2024 4:20 PM
To: Michael Downey <[email protected]>
Cc: [email protected]
Subject: Re: PostgreSQL implicitly double-quoting identifier name with umlaut

Michael Downey <[email protected]> writes:
> One of our internal users, using our tools, added a column called Örtschaft. We anticipated it would be folded to lower case.
> So we inserted our metadata for the column in our metadata with the 
> name örtschaft. With the system query for metadata, we ended up seeing 
> query mismatches involving this column as we found the actual column name is Örtschaft in the database.

When working in UTF8 (or any multibyte encoding), PG's identifier case-folding changes only ASCII letters.  I can't find anything in our SGML docs about this, at least not where I'd expect it to be documented.  The code is pretty clear about what it's doing though:

    /*
     * SQL99 specifies Unicode-aware case normalization, which we don't yet
     * have the infrastructure for.  Instead we use tolower() to provide a
     * locale-aware translation.  However, there are some locales where this
     * is not right either (eg, Turkish may do strange things with 'i' and
     * 'I').  Our current compromise is to use tolower() for characters with
     * the high bit set, as long as they aren't part of a multi-byte
     * character, and use an ASCII-only downcasing for 7-bit characters.
     */

These days the claim that no infrastructure is available is obsolete.
But I'm mighty hesitant to touch this behavior, because it'd almost surely break peoples' apps.  We could do better on the documentation front though.

			regards, tom lane


view thread (3+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: RE: PostgreSQL implicitly double-quoting identifier name with umlaut
  In-Reply-To: <BY3PR05MB798521235A8EAB02E2BDD885DB652@BY3PR05MB7985.namprd05.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox