Re: Questionable description about character sets

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tatsuo Ishii <[email protected]>
To: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Subject: Re: Questionable description about character sets
Date: Fri, 17 Apr 2026 10:28:24 +0900 (JST)
Message-ID: <[email protected]> (raw)
In-Reply-To: <CA+hUKGJLCs7+8sW8ufY8WmiZzRhK+wtMEpe1-tJ6oyy2YEAQQg@mail.gmail.com>
References: <[email protected]>
	<CA+hUKG+HkG-EnYnR_hQzhDCTtdx0Cj-_X-jAzvNkF_=V39jQng@mail.gmail.com>
	<CA+hUKGJLCs7+8sW8ufY8WmiZzRhK+wtMEpe1-tJ6oyy2YEAQQg@mail.gmail.com>

> If we wanted to follow the SQL standard's terminology, I think we'd
> call this the "character repertoire".

Calling it "character repertoire" works for me. Fortunately the
meaning of "character repertoire" in the SQL standard and in other
standard (ISO/IEC 2022 or 10646) looks same.

> In the standard, a "character
> set" is the database object representing a repertoire and an encoding
> of it, or its identifier.

Yes. Unlike ISO/IEC 2022 or 10646, the SQL standard has no clear
distinction between character set (in the sense of ISO/IEC 10646) and
encoding. (To me this is quite confusing.)

> But if we put it in the description column,
> we wouldn't have to name it.

Why?

> Researching the standard led me to
> src/backend/catalog/information_schema.sql[1].  It currently reports
> the encoding name as the character set and the repertoire, except
> s/UTF8/UCS/ for the repertoire.  That's the same information as you
> want to document here.  For the character set (in the SQL standard
> sense), the current view definition seems reasonable given that we
> don't support CREATE CHARACTER SET or CHARACTER SET generally,

Why? For example, Shouldn't EUC_JP have JIS X 0201, JIS X 0208 and JIS
X 0212 as its character repertoire?

> and for
> the character repertoire, the s/UTF8/UCS/ translation makes sense, but
> you chose to call it "Unicode".  Shouldn't those agree?

I think "UCS" is not a repertoire, but a coded character set.
"Unicode" or "Unicode repertoire" [1] is more appropreate, I think.

[1] https://www.unicode.org/reports/tr17/tr17-3.html

> If GB18030 were a valid server encoding, it would surely have to
> report UCS, like UTF8, since it is also a "Unicode transformation
> format"[2] (its purpose is to be backwards compatible with legacy
> 2-byte-per-common-Chinese-character formats while also covering all of
> Unicode 100% systematically, ie booting stuff they don't often encode
> into the 3- and 4-byte zone to make room for efficient encoding of
> stuff they do often encode).  So I think that means your new
> documentation should say UCS (or UNICODE) for that one too.

Not sure. I heard that the latest GB18030 (GB18030-2022, at this
point) does not contain some newer Unicode characters.

> I don't
> know how other encodings should spell their repertoire though...

Need research for me too.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

view thread (9+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Questionable description about character sets
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox