Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vrqLQ-0090kZ-1G for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 04:36:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vrqLP-003bIs-18 for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 04:36:23 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vrqLO-003bHx-2z for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 04:36:23 +0000 Received: from mail-dy1-x132d.google.com ([2607:f8b0:4864:20::132d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vrqLM-00000000xT3-2QEu for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 04:36:22 +0000 Received: by mail-dy1-x132d.google.com with SMTP id 5a478bee46e88-2ba7eb6022eso114782eec.1 for ; Sun, 15 Feb 2026 20:36:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1771216579; cv=none; d=google.com; s=arc-20240605; b=Z6ydshm8hKwpmWDNTFF/+KOlcPipHBDhhHLCfgVrba/K4sGTSHPEAV26Kp/cEkmthg 0Zo7TFAD8MWeRaTfytA7NofPGCsNNQSJUCjy//nYcwjXatSL8hwLIsDEpa0smO+/pejr seayCdAaL7vlca+MxsoVbTRnAdFAh4behfuUIuUGbwR6WkdnbstsaLqyV7W8AwavhS/e QPVEyRMP46sv1uMh+4J+FURZyJn0vuv1OdRRizffocESi9Ik0JRLcVMNrwX/VJg+itSN GmCSJqcLAZF66FJPhsKhERgpEZU0L7KYYgezndRjeMTL5THClsSqLsD1OFZZjuHf+W0j 8JSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=5nyjn0gwtWgD7wIG1QV1FSqZA9+rkk9wTnFQtcT1mHQ=; fh=f3VWyUtzUCnnSa+9AODMeVffBruQWNPGyQt5aeoziKg=; b=QMU2Avd4d+E120Y7TF3LE7NyBaMdrrYAmj+qsyg8oAGEY7y7gmAGw/qb3IbP7TotUK SakUe+52MjnfiayH/RlTUSR9q+tSCrLjULqarhyL8ryIpIOect5gE61lrrV36K+mkOBj 8Gx5CV1HTOIQ1MIRRScuTb0CEHzlSGkveIkemmDpClyvKy76LR3ebXemnXOeeBhXv01Q Mmwo9uWgwrNLnGJ7mB0eLFhu+vrQJcQ/4PXYwva1oIqPgXIlq+P8EvqS/gdzm8HaWgSi skrrYXChMkQOPa7Q9gr+iWP80SVmkVa/SNorjT5qtphPuw8+hEKTpJbpAWA8YcP9zXRT pYZQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771216579; x=1771821379; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5nyjn0gwtWgD7wIG1QV1FSqZA9+rkk9wTnFQtcT1mHQ=; b=IwOgwoswK5yYM4cfkEiQHZhYNzYAFq6V4b7H5hrHBT8NWFbZhBoodyCO3qMDdzxMd6 nlkrQCin9yNZOz9wxOUo1xnM6MBmejJaf1euPvITvkn6enkkjFbePOTMO1pW8b2IPx8D wZDBorDNgdvXIQnaVGF3XR5UBwYMFudkDKr5UQNTzr+SzxUBJ5u3Drx2j8LGDcuhAUjM lrvXQg3wjfNj6J1fP7IvQq95kcPp/IJqyI+Jm0kcWosWInbJXMI+jFNVvUpi+3vIKjKc ugBYMcBi3aZd54XecpAXDYu03KnMjiKTWRe5pWFqEGBItSb9llVEak0XT/t3xzPJHHmO niEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771216579; x=1771821379; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=5nyjn0gwtWgD7wIG1QV1FSqZA9+rkk9wTnFQtcT1mHQ=; b=GWv3rtE7dpgnxStBiYmBUsYTfN3OGL7VOjH/U5BIRgsOuwrSrnJyxQNtvG1781ILVy 6YBeUB7yiQK013+U93MQ+Eag15HUWTB7M1jm9w63nfh0ZwpAYacSXTfT0HslygQfJAKY TpuNk7wbBdLoMQ65vfseyPKG+ZsH++dHJtLnpfnsnPlZB4BUvJP1vfvoSwr8n5lrcm/G ZtspXNvJndF5Q7082sLkUXF9H/VbC4WlpeYUJh0GkUre31oD0JB6dwicfnVCmrN5ojLp os6EfqIjVO0Vaj7o4QyVzPHRUw0fjTKojJ4LAOl91/GRfzA5BfeW3tLdb4ThHWX5qF7c 8Xtg== X-Forwarded-Encrypted: i=1; AJvYcCUOaQ6Mwa+iUV04GWO6iVu9u/4vxYoS+/eQbFYYccg9wTDv3gz6BMlHBQ1ZJvXa3ETAdn3IT/RUzADjTWAG@lists.postgresql.org X-Gm-Message-State: AOJu0YwMBGUWDWxROvNgscBZYXBWhbXMdZxMw20i521Xx52B2sksPLAQ Z8e+sQ0MffwFCkKl5zyshWg6PkCcnMWI38rSjVv5eX+Kdxh25F24p0ZY3NF7lFU/Yt8IBKNHN+B FhBOOXA3sH9WJzYj30Zot5igzfeVTicrzPg== X-Gm-Gg: AZuq6aKhPpvmFBOwMKBzexjCSwUqs1k+tcjKQ3KAigFcEkDiXxKwiXxMHbkSqhEZKoP n8XXBrfeVgMPwBxDCUxAWGmmFbNaTKbpcbDFQRwVLymu5bUoABLzkK+Wi1APCLIi5fJ+r8RELhp vdTPRoJQB6wHzjCgX5bqMhRxg7yoLKGbayfF6AhJJ4eGKi5mTWxF0SFmufIBDM+jpcxzJdDp0MR i57fedXrcYPvGtEBrvVwQNWtlxbXTcY6AmhyPG5blbrCmyCiQn2vm45brM7uoXcsvGHPo0IaPLW Z09B9cS4c1nq+acOOpPFclwBkc28sS+7P2AWl1I0gmDLo4Zf3FKf9OyfRWlPy/KU X-Received: by 2002:a05:7300:ef82:b0:2ba:7013:2b94 with SMTP id 5a478bee46e88-2baba0d9f2amr2430450eec.4.1771216578506; Sun, 15 Feb 2026 20:36:18 -0800 (PST) MIME-Version: 1.0 References: <20260211.185847.1679085676298121526.ishii@postgresql.org> <29fd7c6b-b3cd-4d45-977c-d9ef2f88378a@proxel.se> <20260214.192033.705419152780150580.ishii@postgresql.org> In-Reply-To: <20260214.192033.705419152780150580.ishii@postgresql.org> From: Thomas Munro Date: Mon, 16 Feb 2026 17:35:41 +1300 X-Gm-Features: AaiRm52bV0sWH_R4ZKlzcKvPZqLNtAIinArEr6724rH8neQA9P0AtH4BpOqPZzI Message-ID: Subject: Re: Questionable description about character sets To: Tatsuo Ishii Cc: andreas@proxel.se, pgsql-hackers@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sat, Feb 14, 2026 at 11:20=E2=80=AFPM Tatsuo Ishii wrote: > > Wouldn't that make the table very wide? > > I don't think it would make the table very wide but a little bit > wider. So I think adding the character sets information to > "Description" column is better. Some of encodings already have the > info. See attached patch. When I point my browser at file:///home/tmunro/projects/postgresql/build/doc/src/sgml/html/multibyte.h= tml I see these longer descriptions flowing onto multiple lines making the table cells higher, while the published documentation[1] does only a small amount of that, and then the font instead becomes smaller as I make the window narrower. Is there an easy way to see the final website form in a local build? We'd have more free space in the affected rows if we did s/Extended UNIX Code-JP/EUC-JP/. Why is that acronym expanded, while ISO, ECMA, JIS and CP are not? It might be confusing that the style "ISO 8859-1, ECMA 94" is used to list alternative encoding standards that are aligned or equivalent, while here you're listing the encoding and then the underlying character sets in the same way. Would it be better to put them in parentheses? With those two changes we'd have: EUC_JP | EUC-JP (JIS X 0201, JIS X 0208, JIS X 0212) EUC_JIS_2004 | EUC-JP (JIS X 0201, JIS X 0213) If we really wanted to save horizontal space, I suppose we could drop the Alias column and either list aliases in a new table, or give them their own rows with a description "Alias for ...", but that seems a bit over the top. While wondering if some other rows could be more specific, I noticed that for GBK we have "Extended National Standard". I don't understand these things, but from a quick look at Wikipedia[2], I got the idea that if convert_to('=E2=82=AC', 'GBK') =3D '\x80'::bytea (yes) then what we= have might actually be the yet-further-extended standard known as "GBK 1.0". Do I have that right? As for BIG5, it seems to be an underspecified mess defying description other than "good luck" :-) Thankfully we won't have to list all the standards that MULE_INTERNAL indirectly covers, as it looks like we've agreed to drop it. And IIRC there was a thread somewhere proposing to drop JOHAB... > > And for e.g. European > > character encodings I am not sure it is that useful since most or > > maybe even all of them are subsets of unicode, it mostly gets > > interesting for encodings which support characters not in unicode, > > right? > > Choosing UTF8 or not is just one of the use cases. > > I am thinking about the use case in which user wants to continue to > use other encodings (e.g. wants to avoid conversion to UTF8). > Example: suppose the user has a legacy system in which EUC_JP is > used. The data in the system includes JIS X 0201, JIS X 0208 and JIS X > 0212, and he wants to make sure that PostgreSQL supports all those > character sets in EUC_JP, because some tools does not support JIS X > 0212. Only JIS X 0212 and JIS X 0208 are supported. Currently the info > (whether JIS X 0212 is supported or not) does not exist anywhere in > our docs. It's only in the source code. I think it's better to have > the info in our docs so that user does not need to look into the > source code. Makes sense to me. The underlying character sets must be very important to understand, especially if implementations vary on these points. We should give the information. . o O ( I wonder if anyone has ever tried to make an "XTF-8-JA" encoding just like UTF-8 but with ~1900 high-frequency Japanese codepoints swapped into the 2-byte range U+0080-07ff where Greek, Hebrew, Arabic and others won the encoding lottery. UTF-16 is apparently sometimes preferred to save space in other RDBMSs that can do it, but I suppose you could achieve the same size most of the time with a scheme like that. The other encodings have the desired size, but non-universal character sets. A similar thought for the languages of India, but with the frequency fuzziness factor removed: you could surely map a dozen tiny non-ideographic scripts into that range to save a byte per character... Hindi, Tamil etc didn't get a very good deal with UTF-8. Don't worry, I'm not suggesting that PostgreSQL has any business inventings its own hair-brained encodings, I'm just wondering out loud if that is a kind of thing that exists somewhere out there... ) [1] https://www.postgresql.org/docs/current/multibyte.html [2] https://en.wikipedia.org/wiki/GBK_(character_encoding)