MIME-Version: 1.0
References: <20260211.185847.1679085676298121526.ishii@postgresql.org>
 <29fd7c6b-b3cd-4d45-977c-d9ef2f88378a@proxel.se> <20260214.192033.705419152780150580.ishii@postgresql.org>
In-Reply-To: <20260214.192033.705419152780150580.ishii@postgresql.org>
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 16 Feb 2026 17:35:41 +1300
Message-ID: <CA+hUKG+HkG-EnYnR_hQzhDCTtdx0Cj-_X-jAzvNkF_=V39jQng@mail.gmail.com>
Subject: Re: Questionable description about character sets
To: Tatsuo Ishii <ishii@postgresql.org>
Cc: andreas@proxel.se, pgsql-hackers@lists.postgresql.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CA%2BhUKG%2BHkG-EnYnR_hQzhDCTtdx0Cj-_X-jAzvNkF_%3DV39jQng%40mail.gmail.com>
Precedence: bulk

On Sat, Feb 14, 2026 at 11:20=E2=80=AFPM Tatsuo Ishii <ishii@postgresql.org=
> wrote:
> > Wouldn't that make the table very wide?
>
> I don't think it would make the table very wide but a little bit
> wider. So I think adding the character sets information to
> "Description" column is better. Some of encodings already have the
> info. See attached patch.

When I point my browser at
file:///home/tmunro/projects/postgresql/build/doc/src/sgml/html/multibyte.h=
tml
I see these longer descriptions flowing onto multiple lines making the
table cells higher, while the published documentation[1] does only a
small amount of that, and then the font instead becomes smaller as I
make the window narrower.  Is there an easy way to see the final
website form in a local build?

We'd have more free space in the affected rows if we did s/Extended
UNIX Code-JP/EUC-JP/.  Why is that acronym expanded, while ISO, ECMA,
JIS and CP are not?

It might be confusing that the style "ISO 8859-1, ECMA 94" is used to
list alternative encoding standards that are aligned or equivalent,
while here you're listing the encoding and then the underlying
character sets in the same way.  Would it be better to put them in
parentheses?

With those two changes we'd have:

EUC_JP       | EUC-JP (JIS X 0201, JIS X 0208, JIS X 0212)
EUC_JIS_2004 | EUC-JP (JIS X 0201, JIS X 0213)

If we really wanted to save horizontal space, I suppose we could drop
the Alias column and either list aliases in a new table, or give them
their own rows with a description "Alias for ...", but that seems a
bit over the top.

While wondering if some other rows could be more specific, I noticed
that for GBK we have "Extended National Standard".  I don't understand
these things, but from a quick look at Wikipedia[2], I got the idea
that if convert_to('=E2=82=AC', 'GBK') =3D '\x80'::bytea (yes) then what we=
 have
might actually be the yet-further-extended standard known as "GBK
1.0".  Do I have that right?

As for BIG5, it seems to be an underspecified mess defying description
other than "good luck" :-)  Thankfully we won't have to list all the
standards that MULE_INTERNAL indirectly covers, as it looks like we've
agreed to drop it.  And IIRC there was a thread somewhere proposing to
drop JOHAB...

> > And for e.g. European
> > character encodings I am not sure it is that useful since most or
> > maybe even all of them are subsets of unicode, it mostly gets
> > interesting for encodings which support characters not in unicode,
> > right?
>
> Choosing UTF8 or not is just one of the use cases.
>
> I am thinking about the use case in which user wants to continue to
> use other encodings (e.g. wants to avoid conversion to UTF8).
> Example: suppose the user has a legacy system in which EUC_JP is
> used. The data in the system includes JIS X 0201, JIS X 0208 and JIS X
> 0212, and he wants to make sure that PostgreSQL supports all those
> character sets in EUC_JP, because some tools does not support JIS X
> 0212. Only JIS X 0212 and JIS X 0208 are supported. Currently the info
> (whether JIS X 0212 is supported or not) does not exist anywhere in
> our docs. It's only in the source code. I think it's better to have
> the info in our docs so that user does not need to look into the
> source code.

Makes sense to me.  The underlying character sets must be very
important to understand, especially if implementations vary on these
points.  We should give the information.

. o O ( I wonder if anyone has ever tried to make an "XTF-8-JA"
encoding just like UTF-8 but with ~1900 high-frequency Japanese
codepoints swapped into the 2-byte range U+0080-07ff where Greek,
Hebrew, Arabic and others won the encoding lottery.  UTF-16 is
apparently sometimes preferred to save space in other RDBMSs that can
do it, but I suppose you could achieve the same size most of the time
with a scheme like that.  The other encodings have the desired size,
but non-universal character sets.  A similar thought for the languages
of India, but with the frequency fuzziness factor removed: you could
surely map a dozen tiny non-ideographic scripts into that range to
save a byte per character... Hindi, Tamil etc didn't get a very good
deal with UTF-8.  Don't worry, I'm not suggesting that PostgreSQL has
any business inventings its own hair-brained encodings, I'm just
wondering out loud if that is a kind of thing that exists somewhere
out there... )

[1] https://www.postgresql.org/docs/current/multibyte.html
[2] https://en.wikipedia.org/wiki/GBK_(character_encoding)