Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v7Nmi-001slK-1p for pgsql-hackers@arkaria.postgresql.org; Sat, 11 Oct 2025 00:48:32 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1v7Nme-00EpcV-Tu for pgsql-hackers@arkaria.postgresql.org; Sat, 11 Oct 2025 00:48:29 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v7Nme-00EpcM-Cb for pgsql-hackers@lists.postgresql.org; Sat, 11 Oct 2025 00:48:29 +0000 Received: from mail-pg1-x52a.google.com ([2607:f8b0:4864:20::52a]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1v7NmQ-0017f7-05 for pgsql-hackers@postgresql.org; Sat, 11 Oct 2025 00:48:28 +0000 Received: by mail-pg1-x52a.google.com with SMTP id 41be03b00d2f7-b62ed9c3e79so1649274a12.0 for ; Fri, 10 Oct 2025 17:48:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=j-davis-com.20230601.gappssmtp.com; s=20230601; t=1760143692; x=1760748492; darn=postgresql.org; h=mime-version:user-agent:content-transfer-encoding:date:to:from :subject:message-id:from:to:cc:subject:date:message-id:reply-to; bh=NeYeRtyVRMQe1k4LE6H5ElBYyFhPnzyUYgArM478Mzo=; b=xyEvY2tFm21qk5NPE8uGXj3K0Sj8gX7MN7ZdqCSN3qCB6oeF9ZUk91SHZSj0oHwv4Y k73GzOlqBB/IALByfLc5QePC0foMoPSXANfPd8jk+VaCqE49sBGP8knjUoQShrgCffK+ XH8uF0Fn0zptThZT8Tx2ESck1D5GkftvJ32f0EMWguNUacFKfWTdKPmGL8c5OuvAdMQs tMSe0wgibO8CLRXmzm8i+g9YTVwvn+ZgU7zKiOdgmAQ0Yv+jI+qMWiBU/xYeiM6FVmru LWm0gcmaUIGxsPVDtLSR1dS+FVwmKwjab0bfpjvyiKg5FML4bCjWu0ZlmeNuMSW9LQeB EO5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760143692; x=1760748492; h=mime-version:user-agent:content-transfer-encoding:date:to:from :subject:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=NeYeRtyVRMQe1k4LE6H5ElBYyFhPnzyUYgArM478Mzo=; b=P/iucYb3Hm0o7bRafkrQ8P55Wy7WQhXxg1HXmXZaSao1EVJJpB56uiHqIPb1Hi/3+0 LCBFbDdhbp9t7/pP1VQtQ0Op/zG4/+1A/2l1jpnO2XrnAl4AzBrAMACt/jKlPCsOtvrr EUHqh2Wozkb7QMipUJ9t3noYPH+ZNoxCBbdsNby5FDKBOHfnUVGbRS6ChsOSf71rNctN OE42t7Hq8NcSy3nRAem0bvqctYrpKkVz6NPZN/4c33417qWRZdiUABGmIMBa2jefpQv7 Z6+uRqPr5FQ0mQ8MIYojPh+XH7vL0qyMFKAxI66c6EmWkOZ5jsCpNXTq0Xvlztujw3TX tlBA== X-Gm-Message-State: AOJu0Yz+L/b9CozVPbLSKgaP84hPWfFN4HLlONsDDbTONk9y/9BMG3BG m2+z+2uRi9zEEBwKzCBOVL1/SpCO4r4+6MhQblJyfk7IFljWJiuOpllvNchkaK2nKL3UlWPUk8s Fc7Q= X-Gm-Gg: ASbGncuqlTfofzc/gjHvrEotwph2MsiEputflqeoeq5q4XhQ06784YwN+iKw+JZOesD M1nVTma36Z5hnKP3O340/O29jAhoYhjP+U7d1DJCOUiAJDQpESLkfFw4jz2B3ZkyLqjvuUU4Y0Y 4Cz+BVwES6mZ4R8ZY56jz81C0yiQmWQgEtKhcHcZpFezVqebuTaeznK4awTqkDsRe41Y3jJ2HEl 8qyX5nH526g0FWlJa6bkOa3vpDtB4YtP4HTy5XCtiAMMATAvokgC8SVUzlGx9frfqtx/eoe6Man 7BEvfx1IkiN11LJHB5g1HwXCyJNrIez4Sq93CBn9XAxqFA6KjOg8HYWM75JicsxiDnc1Np+aTKm RYbTAVxx0FJeI9u6WyZHbHles01pMCv9bI7JR1ECTU4DU26o69ALHVkAlB0UbRwKnJs10cWEaC4 alBvWTHfPB X-Google-Smtp-Source: AGHT+IHK3Lj8IemHABHz/ppOpKqiZ5ah73gEgtYLTBHt2Z1A36B/lLIk2OQnNLES3dRFq52wzQoSZg== X-Received: by 2002:a05:6a21:3383:b0:248:e0f7:1331 with SMTP id adf61e73a8af0-32da80bb93bmr18195558637.3.1760143691947; Fri, 10 Oct 2025 17:48:11 -0700 (PDT) Received: from jeff-ws-bridge.lan (c-24-7-19-3.hsd1.ca.comcast.net. [24.7.19.3]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7992d09651csm4261549b3a.48.2025.10.10.17.48.11 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Oct 2025 17:48:11 -0700 (PDT) Message-ID: Subject: Change initdb default to the builtin collation provider From: Jeff Davis To: pgsql-hackers@postgresql.org Date: Fri, 10 Oct 2025 17:48:10 -0700 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.52.3-0ubuntu1 MIME-Version: 1.0 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk ------- Summary ------- The libc collation provider is a bad default[1]. The builtin collation provider is a good default, so let's use that. ---------- Motivation ---------- The initdb default is what we choose for new postgres instances when we have little information about the user's collation needs. The default has no effect on upgrades, which always use the previous instance's locale settings. There are a number of factors to consider when choosing a default: * Risk of primary key inconsistencies due to libc or ICU updates * Performance * Quality of query semantics for a variety of scripts, languages and regions (excluding the final result order) * Final result ordering / display In the absence of specific user requirements, these factors weigh heavily in favor of the builtin collation provider, and heavily against libc. With the builtin provider, there's no risk of primary key or plain index inconsistencies, the performance is great (ordering with memcmp()), and the query semantics are based on Unicode. ------------------- Why does it matter? ------------------- Arguably, we could just not worry and let various service providers, tools, scripts, packages, and wrappers make the choice independently. But that just moves the problem -- someone still needs to make that choice. Collectively, we have built up some knowledge about collation here on -hackers, and it would be good to offer it as guidance. And an initdb default is a good way to offer that guidance. Unifying around one default also creates a more consistent, tested, and documented Postgres experience that benefits hackers and users alike. -------------------------------- What's the catch? Display order. -------------------------------- The builtin provider uses code point order, i.e. memcmp(), so the final result display order is less human-friendly. For instance, 'Z' comes before 'a'. That problem is annoying, but *much* easier to fix than the other factors. The user might add a COLLATE clause to the final ORDER BY, or perform the sort in the application layer or presentation layer. Other providers offer a better final display order, but it comes at a heavy price: index inconsistencies and poor performance. Those problems are hard to address in an existing system. Some users may be willing to pay that price, but it should be opt-in. Furthermore, in the default case, we don't even really know which language and region to use. We infer it from the environment variable LC_COLLATE at initdb time, but that's a weak signal: there's little reason to think that the OS admin, DBA, and end user are all in the same locale. In general, there's little reason to think that a single locale for display order is enough for a whole database. Often, databases are used (directly or indirectly) by people from dozens of locales. When per-locale display order becomes an issue, it will be necessary to add COLLATE clauses or application logic to tailor to the end user regardless, so the database default locale won't be useful. For all of these reasons, display order is the wrong thing to optimize for when the user doesn't specify anything. We should prioritize the other factors, and for those other factors, the builtin provider is the best. ------------ Why not ICU? ------------ ICU is better than libc in a lot of ways: * Better performance * Platform-independent * Easier to manage it as a separate library But fundamentally, I don't think it's a great default, because it favors final result display order at the risk of primary key inconsistencies. ------------------ Other Alternatives ------------------ In a previous thread[1], I laid out some alternatives. If someone disagrees with this proposal, please choose one of those or suggest a new one. The most interesting alternative, in my opinion, is #4, but that was soundly rejected. --------------------- Which builtin locale? --------------------- All builtin locales use the exact same ordering: they sort by code point. Code point order is stable, so primary keys and plain indexes remain consistent across upgrades forever. The difference is in case conversion and character classification semantics: 1. C: Only basic ASCII semantics which never change. 2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this is similar to the libc "C.UTF-8" locale available on Linux. It's also similar to the default semantics of at least one big commercial database, making migrations easier. 3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more aligned with the SQL standard, which specifies in an example the uppercase of '=C3=9F' is 'SS'. For the latter two locales, expression and partial indexes depending on these semantics may be subject to inconsistencies after a Unicode update. I propose changing the default to PG_C_UTF8 because it seems simple and practical. However, I'm also fine with PG_UNICODE_FAST if those affected by the "full" case mapping find it helpful. "C" is also a possibility, but the query semantics suffer. All are better than libc. ------- Details ------- The mechanics of the default itself are being worked out here[2]. The concrete proposal here is to commit those patches, and then change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and DEFAULT_BUILTIN_LOCALE to whatever we choose here. Note: the builtin provider requires UTF-8, which can potentially conflict with the LC_CTYPE. Fortunately, when the builtin provider is being used, LC_CTPE has little effect. To further reduce the consequences of LC_CTYPE when using the builtin provider, another patch[3] fixes tsearch to parse based on the database default locale rather than depending on LC_CTYPE. Comments welcome. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d023594= 7e1.camel@j-davis.com [2] https://www.postgresql.org/message-id/7d424dc0b032b30a22220634d12377bf59524= bdb.camel@j-davis.com [3] https://www.postgresql.org/message-id/0151ad01239e2cc7b3139644358cf8f7b9622= ff7.camel@j-davis.com