public inbox for [email protected]  
help / color / mirror / Atom feed
From: Jeff Davis <[email protected]>
To: [email protected]
Subject: Re: Change initdb default to the builtin collation provider
Date: Fri, 31 Oct 2025 14:30:19 -0700
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>

On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:
> -------
> Summary
> -------
> 
> The libc collation provider is a bad default[1]. The builtin
> collation
> provider is a good default, so let's use that.

The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:

0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.

0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.

Motivation:

* UTF-8 seems safer than SQL_ASCII when the locale is compatible with
either.

* Whether the "C" locale uses the builtin provider or the libc provider
is mostly about the catalog representation, because the implementation
is the same. I don't have a strong motivation for this change, it just
clarifies that libc is not actually being used when the locale is "C".

* I think most users of the "C.UTF-8" locale would be better off with
the builtin provider, which benefits from important optimizations.

Note:

This would mean that "initdb --no-locale" would select UTF-8 and the
builtin provider with locale "C", whereas previously it would have
selected SQL_ASCII and the libc provider (though it didn't ever really
use libc internally). I'm not sure if others want this behavior or if
it would be surprising.

Regards,
	Jeff Davis



Attachments:

  [text/x-patch] v1-0001-initdb-prefer-UTF-8-encoding-over-SQL_ASCII.patch (1.0K, 2-v1-0001-initdb-prefer-UTF-8-encoding-over-SQL_ASCII.patch)
  download | inline diff:
From 9c8cf58c541462a6aef43fed0ddea1e9f1633960 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 31 Oct 2025 13:36:46 -0700
Subject: [PATCH v1 1/2] initdb: prefer UTF-8 encoding over SQL_ASCII.

This was already true for the ICU locale provider, make it true for
the others.
---
 src/bin/initdb/initdb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..aa7fc5a6636 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2718,10 +2718,10 @@ setup_locale_encoding(void)
 		ctype_enc = pg_get_encoding_from_locale(lc_ctype, true);
 
 		/*
-		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. ICU does
-		 * not support SQL_ASCII, so select UTF-8 instead.
+		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. Prefer
+		 * UTF-8.
 		 */
-		if (locale_provider == COLLPROVIDER_ICU && ctype_enc == PG_SQL_ASCII)
+		if (ctype_enc == PG_SQL_ASCII)
 			ctype_enc = PG_UTF8;
 
 		if (ctype_enc == -1)
-- 
2.43.0



  [text/x-patch] v1-0002-initdb-if-locale-is-C-or-C.UTF-8-use-builtin-prov.patch (2.4K, 3-v1-0002-initdb-if-locale-is-C-or-C.UTF-8-use-builtin-prov.patch)
  download | inline diff:
From 8b1659fab50396eaeacab042aeaef8df241af467 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 31 Oct 2025 14:05:10 -0700
Subject: [PATCH v1 2/2] initdb: if locale is C or C.UTF-8, use builtin
 provider.

If the provider is unspecified, use the builtin provider C or
C.UTF-8. If the provider is specified, then do not override it.

The C locale has always been, effectively, the builtin provider, in
the sense that it uses built-in logic rather than strcoll(), etc. The
change here is mostly about the catalog representation.

The C.UTF-8 locale has used libc, but by doing so, collation doesn't
benefit from important performance optimizations. Now that we have a
builtin "C.UTF-8" collation which does benefit from those
optimizations, use that.
---
 src/bin/initdb/initdb.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index aa7fc5a6636..84931f145f4 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -145,6 +145,7 @@ static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
 static char locale_provider = COLLPROVIDER_LIBC;
+static bool locale_provider_specified = false;
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -2465,6 +2466,28 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
+	/*
+	 * If the locale is C or C.UTF-8, and no provider was specified, use the
+	 * builtin provider rather than libc.
+	 */
+	if (!locale_provider_specified && locale_provider == COLLPROVIDER_LIBC)
+	{
+		if (strcmp(lc_ctype, lc_collate) == 0)
+		{
+			if (strcmp(lc_ctype, "C") == 0)
+			{
+				locale_provider = COLLPROVIDER_BUILTIN;
+				datlocale = "C";
+			}
+			else if (strcmp(lc_ctype, "C.UTF-8") == 0 ||
+					 strcmp(lc_ctype, "C.UTF8") == 0)
+			{
+				locale_provider = COLLPROVIDER_BUILTIN;
+				datlocale = "C.UTF-8";
+			}
+		}
+	}
+
 	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
 		pg_fatal("locale must be specified if provider is %s",
 				 collprovider_name(locale_provider));
@@ -3362,6 +3385,8 @@ main(int argc, char *argv[])
 										 "-c debug_discard_caches=1");
 				break;
 			case 15:
+				locale_provider_specified = true;
+
 				if (strcmp(optarg, "builtin") == 0)
 					locale_provider = COLLPROVIDER_BUILTIN;
 				else if (strcmp(optarg, "icu") == 0)
-- 
2.43.0



view thread (16+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected]
  Subject: Re: Change initdb default to the builtin collation provider
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox