Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spFaS-0007k2-Kw for pgsql-sql@arkaria.postgresql.org; Fri, 13 Sep 2024 23:20:25 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1spFaR-00Ez19-NU for pgsql-sql@arkaria.postgresql.org; Fri, 13 Sep 2024 23:20:23 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spFaR-00Ez0I-DR for pgsql-sql@lists.postgresql.org; Fri, 13 Sep 2024 23:20:23 +0000 Received: from sss.pgh.pa.us ([68.162.161.243]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spFaO-0012Dy-Gl for pgsql-sql@lists.postgresql.org; Fri, 13 Sep 2024 23:20:22 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 48DNKGtr156005; Fri, 13 Sep 2024 19:20:16 -0400 From: Tom Lane To: Michael Downey cc: "pgsql-sql@lists.postgresql.org" Subject: Re: PostgreSQL implicitly double-quoting identifier name with umlaut In-reply-to: References: Comments: In-reply-to Michael Downey message dated "Fri, 13 Sep 2024 23:06:11 -0000" MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-ID: <156003.1726269616.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Fri, 13 Sep 2024 19:20:16 -0400 Message-ID: <156004.1726269616@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Michael Downey writes: > One of our internal users, using our tools, added a column called =C3=96= rtschaft. We anticipated it would be folded to lower case. > So we inserted our metadata for the column in our metadata with the name= =C3=B6rtschaft. With the system query for metadata, we > ended up seeing query mismatches involving this column as we found the a= ctual column name is =C3=96rtschaft > in the database. When working in UTF8 (or any multibyte encoding), PG's identifier case-folding changes only ASCII letters. I can't find anything in our SGML docs about this, at least not where I'd expect it to be documented. The code is pretty clear about what it's doing though: /* * SQL99 specifies Unicode-aware case normalization, which we don't ye= t * have the infrastructure for. Instead we use tolower() to provide a * locale-aware translation. However, there are some locales where th= is * is not right either (eg, Turkish may do strange things with 'i' and * 'I'). Our current compromise is to use tolower() for characters wi= th * the high bit set, as long as they aren't part of a multi-byte * character, and use an ASCII-only downcasing for 7-bit characters. */ These days the claim that no infrastructure is available is obsolete. But I'm mighty hesitant to touch this behavior, because it'd almost surely break peoples' apps. We could do better on the documentation front though. regards, tom lane