MIME-Version: 1.0
References: <e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com>
 <47e1b4f72fe732c5ae85c6cf2c085b4e99a10120.camel@j-davis.com>
 <CA+TgmobnOn9ipEUcthLmxvv0ZKBWyCcc048qqgqy1X+tszw_Cg@mail.gmail.com>
 <4309879ac305b1cf6b4d7b5fb85bc7b62c6ab768.camel@j-davis.com>
 <CA+TgmoYPfiz6_QjkiKykwziFDZFzPVf_x77rChYBu=OGVOmbYw@mail.gmail.com>
 <be03b4f8e5ed147a3e733d9a73ddeb51e6b83ad4.camel@j-davis.com>
 <CA+Tgmob7WJ3iETZea=P_eQ2-74q6xE_Vqnz074M5w5VD=dKURA@mail.gmail.com>
 <e364efcfa40af47aca9071ea81b38ce5573556fe.camel@j-davis.com>
In-Reply-To: <e364efcfa40af47aca9071ea81b38ce5573556fe.camel@j-davis.com>
From: Robert Haas <robertmhaas@gmail.com>
Date: Thu, 12 Mar 2026 15:58:17 -0400
Message-ID: 
 <CA+TgmoYGb1Zd+8cAqoW+SvHstgfQs_6xMUg_ueVYdpz4hpaM-A@mail.gmail.com>
Subject: Re: Change initdb default to the builtin collation provider
To: Jeff Davis <pgsql@j-davis.com>
Cc: pgsql-hackers@postgresql.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BTgmoYGb1Zd%2B8cAqoW%2BSvHstgfQs_6xMUg_ueVYdpz4hpaM-A%40mail.gmail.com>
Precedence: bulk

On Thu, Mar 12, 2026 at 3:20=E2=80=AFPM Jeff Davis <pgsql@j-davis.com> wrot=
e:
> pg_upgrade should copy the locale settings to the new cluster as of
> 9637badd9f. If there are still some rough edges here, let me know.

Oh, interesting! I obviously missed that.

> > I don't have total information, but I think they mostly use a single
> > locale. If they have extremely specific needs, they are likely to end
> > up with ICU, else they pick a glibc locale. I have no idea how likely
> > that glibc locale is to match their environment. I wouldn't bet on it
> > being the norm, but I wouldn't bet against whatever they have in the
> > environment being more usable than "C".
>
> That's interesting. In other words, (in your sample) users aren't
> worried about the precise sort order in their native language; it's
> just that ASCII is particularly bad, and almost any "real" locale is
> more appealing.

Some users are definitely worried about the specific ordering. I
didn't mean to imply the contrary. The most demanding customers want
an ordering that is bug-compatible with a previous system they've
used; many can settle for something that is reasonably similar to some
other system they've used, or simply want something appropriate to
their country and language. Some are less demanding and, yeah, just
anything that is basically reasonable is OK.

> Range scans using a natural language collation are dubious. It can't be
> for a prefix search; LIKE 'myprefix%' needs the index to be defined
> with text_pattern_ops (which is code point order), so the default isn't
> going to work for them anyway.
>
> (A prefix search can't be implemented with a range scan in natural
> language collation because, e.g. in the cs_CZ locale, 'cha' does not
> fall between 'ch' and 'ci'.)
>
> So how often is a range scan using a natural language collation
> actually useful? I'm sure there are some real cases, but I'd say it's
> usually a mistake and they are quite possibly getting wrong results.

That seems really pessimistic to me. Many people don't test their code
as thoroughly as they should, but a lot of people test it at least
somewhat.

> I have trouble understanding this perspective: slow all indexes down
> (and sorts, too), and risk index inconsistencies just in case someone
> ends up doing a range scan on one of the indexes? How is that safer?

I mean ... you seem to be imagining that people care about sort and
index lookup speed drastically more than what I have experienced.
People mostly build indexes if they dump and restore, which most
people do very rarely, bordering on never. Or if they reindex
occasionally, which is more common, but still not an everyday
occurrence for the overwhelming majority of users. Index lookups are
common, but the difference between a single-entry lookup on a
collation-C index and a single-entry lookup on a glibc-collation index
isn't enough to worry about in any case I've ever encountered. I'm
sure it's possible to construct cases, but are they really that
common? Most queries do a lot more other work than the time they spend
doing btree comparisons, at least IME. On the other hand, the
possibility of having to use a sequential scan to find entries between
X and Y is a potentially huge blow. I would never choose to use
collate "C" if I thought there was any chance I would want a range
scan. Surely it isn't worth the risk. If I know the data is something
where collation is never going to matter (like PostgreSQL log lines
ingested into a table) then yes, collate "C" makes a lot of sense.

>   * the use case must be real (not relying on faulty assumptions about
> lexicographical ordering)
>   * the input data must be large enough to benefit from an index scan
>   * one of the following must be true:
>     - the index needs to be correlated with the heap order (seems
> unlikely; correlation usually happens with sequences, timestamps, etc.,
> not natural language text values); or
>     - needs to be eligible for an index only scan (plausible); or
>     - the amount of data read must be small enough that correlation
> with the heap doesn't matter
>   * the result data needs to be small enough for a human to consume it
> (otherwise why bother with natural language?)
>   * the performance improvement must be enough to offset the penalty
> for equality searches and index maintenance
>
> While each of those is plausible, when combined, I think it's far from
> the typical case.

I don't. I mean, let's suppose I've got a web page that displays a
report with peoples names, street addresses, cities, states,
countries, and telephone numbers. I can click on a column header to
sort by that column. For which of those columns do I want a natural
language sort? I would argue probably most of them. I might know that
my state and country name are all ASCII, and then it doesn't matter.
But surely people's names, for example, or their cities, could contain
non-ASCII characters? And my report might easily be paginated, to
avoid loading too much data in the web browser at once. If displaying
the first page I want something like LIMIT 100, and I don't want to
have to compute the entire result set to get those first 100 rows.

Back when I wrote web applications, before starting at EDB, this is
the kind of thing that I did all the time, for like ten years
straight. I had plenty of text fields that could have used collate
"C", because they contained things like part numbers or account
numbers or whatever. But anything that contained a person's name or a
company name or any other kind of name that is assigned by humans
rather than generated by a computer could contain any of the
characters that humans use, and should be sorted the way humans like.
And isn't this a totally normal kind of application for somebody to
write? It sure was for me.

--=20
Robert Haas
EDB: http://www.enterprisedb.com