Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0mB2-002CNQ-2v for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 19:58:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w0mAz-00HI6z-14 for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 19:58:33 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0mAy-00HI6q-2n for pgsql-hackers@lists.postgresql.org; Thu, 12 Mar 2026 19:58:33 +0000 Received: from mail-ed1-x52c.google.com ([2a00:1450:4864:20::52c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w0mAx-00000001p0j-1CaM for pgsql-hackers@postgresql.org; Thu, 12 Mar 2026 19:58:32 +0000 Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-6615c766e60so2377144a12.3 for ; Thu, 12 Mar 2026 12:58:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773345509; cv=none; d=google.com; s=arc-20240605; b=XTk2g+IxZR7SMGnkuxoByNbOTlBrblWaACXNPDOl8EpZGRaRb8kqjGdZ7NdPSq/eWy vyzuuDrO3ZN4qUATUgXDF1UmV1bnQ8xsbd52W0KtrYI0WgwqlRtXVcyl/4QYDUq/KJCK WY+v3S03CmT/88ke+2fYtvV/CJG08ZR6JFI615kt9J/xnzZ24deg15kFW3aCHXRPrOiB SPRH+c3rPBHkKsNBhQQRUzkK9OgeNZi0pT3Ki+GOEzGg3WtMo+FFCgnKF2qvqaaLTdrm N6RVRtrPjF63D9hVtO7LJOVyYVCnEhyWr7jnoGTaF+4vXekY28+h6YJVs6/2JGtXRgGt I04w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=nh8g8Znha4b+oCzrIH0gCvUzhd142fziMe0W7ljYGmE=; fh=2zSQZfKdmhwzAOqZC07nBJoHnw6XGaE3I4pRylMnrzM=; b=eClfnoT8/TXoQMDMGP/bNxpRXSxhZik+7CBT35qvO5pWrFvnIQDvKPsmowbEUY6820 Iq2ksCX5Z8rjjQJyVSxlFy0qslMpXXFFytLseL5+XWO3ilf4wF6JM1e3yPDme32Q29K5 vdwVuZYFOVf/bEGPtVCy4J5ce8CzfMSXJYRAkmx82GfpQ54bimmibbokKHwOdfBgd4f/ 9PQjKa9DNgK5Gv/qEYWFIqXaKdazGlkQlhg7V7gnjKDIjyvwjdHA9mpfy4aKPSLyNIvm MXb/2guS0cjar4GTSuYapJcANAGIBcaopeb40fyO/7Mb+CVQy5Uw7FuAYO+Bd2ljVFIF w3lA==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773345509; x=1773950309; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nh8g8Znha4b+oCzrIH0gCvUzhd142fziMe0W7ljYGmE=; b=dwsyXrXNm8L79fwAk/Ri4Gue5+T3fgylYiVny4bCMgW8lEFW/+N+dK2t3tNzEw7b9a Rqpg3u3eJztP8zpL5y5kbkiwZnH+xF32svMyU3lNgPMBrHvvzeLjOtW/zoFI2iILm7Ix NkAoD/fSl70CXg3KjPyou4hdm/9t/W36Cy6ceDY0msURwq/i4YLbjVsQodK4mOvLC5Vo TmS+2Gnkl5pvzizl1fVoo+szwb/icPuUTDX6oFpVEulb/NaXM64WSDPlhyW6FZrL3xE3 u9cjlHlTjM7Vpma7apToiF8llJ2c5NdLTa+ihHjnfwb3m4Nu/m+Xy3kcfBhEC+XMLVmX AhMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773345509; x=1773950309; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=nh8g8Znha4b+oCzrIH0gCvUzhd142fziMe0W7ljYGmE=; b=RcdQ8cgzuwcGYno8w7aAp09QfhtmFhTPDx+qcjt8qmbZmwbb9LP+nUdTXJ2d/ON7Lb 9XaOiftnhlJto8x/HOpwbZSQL0/UIBPkLQJtbhKGAuW7QSPEDjyF8Ffzuix+AB9LHg+h oFJMS+QAyTaDZngMtSRrrGc6072T1qwOAiu6DIb/x4RxNr1LL0U5XK0jspUMd6OpEjjS kx1/Y+gCkWioN2JThNndYwYNlLs8SNRhtsnYoIuEc1A3c49Ey9qsvfWCIS23pYMKcCW/ N93+AXtu1M7IxRQxvODrGBbyg38+VPxZWqBAIBZQOGa4rt+RabeRyKre4ucTAEmoKXNZ 5EtQ== X-Gm-Message-State: AOJu0Yx6xWxlUa2GCMOFLNxkFOtvkTQskOuB0le5aQpxjZ/ziiPMKjNL nxqlruhzI6epUWGTtsdnOWf7PvW5z8cbb7m117+0V/Gxi0Lkblvc6ziGUIv2CsidJoVHSHmvUZY SPBLqz7kQcfUW0BNN0Wm+FZWXPLYxi46MZg== X-Gm-Gg: ATEYQzyrjOW8zviEWboO+9M3PE+wu32Gg8DgB83LrnENZA6RJf4h5hRRY4TgZBCaoc+ OdkLUyzRgvlSSgtZLyKs3jWYpIcYNgZYw3GTpFOt79Em5F8flvPwyk9RoZinPvmh//5ybejBz2O z9lh6JcYHaYRPNZU3KIwlsAWeY7iUotr6aHci8Ri6+UTTt9q0fIqnoO5OM9YcgDFYyIFRvlr9gu R2WnmHydNQ7D3oiqGg11fqLeHherF/Zhx5VWla1MuqYaffYqSdjsO9IweI3UVSalzzVvEbx6nJi WlXAXk4UFguHhglaR+JYaD8D3yQhT3tagqkS2N9W X-Received: by 2002:a17:906:1148:b0:b94:898:7bd2 with SMTP id a640c23a62f3a-b976510108bmr29362966b.39.1773345509213; Thu, 12 Mar 2026 12:58:29 -0700 (PDT) MIME-Version: 1.0 References: <47e1b4f72fe732c5ae85c6cf2c085b4e99a10120.camel@j-davis.com> <4309879ac305b1cf6b4d7b5fb85bc7b62c6ab768.camel@j-davis.com> In-Reply-To: From: Robert Haas Date: Thu, 12 Mar 2026 15:58:17 -0400 X-Gm-Features: AaiRm53DVt7F1VWY_4FStTiJCJY46FYLtax_SxGFq8bP3RsNIweg0JTc9o9Q7k0 Message-ID: Subject: Re: Change initdb default to the builtin collation provider To: Jeff Davis Cc: pgsql-hackers@postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Mar 12, 2026 at 3:20=E2=80=AFPM Jeff Davis wrot= e: > pg_upgrade should copy the locale settings to the new cluster as of > 9637badd9f. If there are still some rough edges here, let me know. Oh, interesting! I obviously missed that. > > I don't have total information, but I think they mostly use a single > > locale. If they have extremely specific needs, they are likely to end > > up with ICU, else they pick a glibc locale. I have no idea how likely > > that glibc locale is to match their environment. I wouldn't bet on it > > being the norm, but I wouldn't bet against whatever they have in the > > environment being more usable than "C". > > That's interesting. In other words, (in your sample) users aren't > worried about the precise sort order in their native language; it's > just that ASCII is particularly bad, and almost any "real" locale is > more appealing. Some users are definitely worried about the specific ordering. I didn't mean to imply the contrary. The most demanding customers want an ordering that is bug-compatible with a previous system they've used; many can settle for something that is reasonably similar to some other system they've used, or simply want something appropriate to their country and language. Some are less demanding and, yeah, just anything that is basically reasonable is OK. > Range scans using a natural language collation are dubious. It can't be > for a prefix search; LIKE 'myprefix%' needs the index to be defined > with text_pattern_ops (which is code point order), so the default isn't > going to work for them anyway. > > (A prefix search can't be implemented with a range scan in natural > language collation because, e.g. in the cs_CZ locale, 'cha' does not > fall between 'ch' and 'ci'.) > > So how often is a range scan using a natural language collation > actually useful? I'm sure there are some real cases, but I'd say it's > usually a mistake and they are quite possibly getting wrong results. That seems really pessimistic to me. Many people don't test their code as thoroughly as they should, but a lot of people test it at least somewhat. > I have trouble understanding this perspective: slow all indexes down > (and sorts, too), and risk index inconsistencies just in case someone > ends up doing a range scan on one of the indexes? How is that safer? I mean ... you seem to be imagining that people care about sort and index lookup speed drastically more than what I have experienced. People mostly build indexes if they dump and restore, which most people do very rarely, bordering on never. Or if they reindex occasionally, which is more common, but still not an everyday occurrence for the overwhelming majority of users. Index lookups are common, but the difference between a single-entry lookup on a collation-C index and a single-entry lookup on a glibc-collation index isn't enough to worry about in any case I've ever encountered. I'm sure it's possible to construct cases, but are they really that common? Most queries do a lot more other work than the time they spend doing btree comparisons, at least IME. On the other hand, the possibility of having to use a sequential scan to find entries between X and Y is a potentially huge blow. I would never choose to use collate "C" if I thought there was any chance I would want a range scan. Surely it isn't worth the risk. If I know the data is something where collation is never going to matter (like PostgreSQL log lines ingested into a table) then yes, collate "C" makes a lot of sense. > * the use case must be real (not relying on faulty assumptions about > lexicographical ordering) > * the input data must be large enough to benefit from an index scan > * one of the following must be true: > - the index needs to be correlated with the heap order (seems > unlikely; correlation usually happens with sequences, timestamps, etc., > not natural language text values); or > - needs to be eligible for an index only scan (plausible); or > - the amount of data read must be small enough that correlation > with the heap doesn't matter > * the result data needs to be small enough for a human to consume it > (otherwise why bother with natural language?) > * the performance improvement must be enough to offset the penalty > for equality searches and index maintenance > > While each of those is plausible, when combined, I think it's far from > the typical case. I don't. I mean, let's suppose I've got a web page that displays a report with peoples names, street addresses, cities, states, countries, and telephone numbers. I can click on a column header to sort by that column. For which of those columns do I want a natural language sort? I would argue probably most of them. I might know that my state and country name are all ASCII, and then it doesn't matter. But surely people's names, for example, or their cities, could contain non-ASCII characters? And my report might easily be paginated, to avoid loading too much data in the web browser at once. If displaying the first page I want something like LIMIT 100, and I don't want to have to compute the entire result set to get those first 100 rows. Back when I wrote web applications, before starting at EDB, this is the kind of thing that I did all the time, for like ten years straight. I had plenty of text fields that could have used collate "C", because they contained things like part numbers or account numbers or whatever. But anything that contained a person's name or a company name or any other kind of name that is assigned by humans rather than generated by a computer could contain any of the characters that humans use, and should be sorted the way humans like. And isn't this a totally normal kind of application for somebody to write? It sure was for me. --=20 Robert Haas EDB: http://www.enterprisedb.com