public inbox for [email protected]  
help / color / mirror / Atom feed
From: Thomas Munro <[email protected]>
To: Nico Williams <[email protected]>
Cc: Tatsuo Ishii <[email protected]>
Cc: [email protected]
Cc: [email protected]
Subject: Re: Questionable description about character sets
Date: Tue, 17 Feb 2026 15:38:05 +1300
Message-ID: <CA+hUKGKYawRXAhsJzwQ_JAPnFhM7uufMYbpG2RizLUmzuRR8tw@mail.gmail.com> (raw)
In-Reply-To: <aZKmF2C9SAzOgAP9@ubby>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<CA+hUKG+HkG-EnYnR_hQzhDCTtdx0Cj-_X-jAzvNkF_=V39jQng@mail.gmail.com>
	<aZKmF2C9SAzOgAP9@ubby>

On Mon, Feb 16, 2026 at 6:07 PM Nico Williams <[email protected]> wrote:
> On Mon, Feb 16, 2026 at 05:35:41PM +1300, Thomas Munro wrote:
> >                                              [...].  UTF-16 is
> > apparently sometimes preferred to save space in other RDBMSs that can
> > do it, but I suppose you could achieve the same size most of the time
> > with a scheme like that.  [...]
>
> [Off-topic] I think UTF-16 yielding smaller encodings is a truism.  It
> really depends on what language the text is mostly written in, but
> mostly it's a truism that's not true.  Anyways, UTF-16 has to go away,
> and the sooner the better.

But when it's true for your language and that's what your database
holds, then it's true all the time, and it's not just outliers, we're
talking about nearly all of Asia's languages.  That's ... a lot of
NAND gates being wasted due to arbitrary choices made probably before
UTF-8 even existed.

I do agree with you that UTF-16 has turned out to be an odd beast,
though, not big enough but also too big.  Maybe it's only just right
for CJK (or CJ?).  I don't see much chance at all of anyone
retro-fitting UTF-16 into PostgreSQL anyway, so I wouldn't worry about
that.  I could more easily see us figuring out how to drop the
requirement for high bits in multi-byte sequence tails so that GB18030
could be used to store two-byte Chinese (while also retaining full
access to all of Unicode as it does), and I was basically wondering
out loud if Japan might be hiding something like that somewhere and
imagining what it might look like.






view thread (9+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Questionable description about character sets
  In-Reply-To: <CA+hUKGKYawRXAhsJzwQ_JAPnFhM7uufMYbpG2RizLUmzuRR8tw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox