Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vpwOp-001bkj-1I for pgsql-hackers@arkaria.postgresql.org; Tue, 10 Feb 2026 22:40:04 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vpwOo-001IiE-2A for pgsql-hackers@arkaria.postgresql.org; Tue, 10 Feb 2026 22:40:03 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vpwOo-001Ii6-17 for pgsql-hackers@lists.postgresql.org; Tue, 10 Feb 2026 22:40:03 +0000 Received: from mail-dy1-x132d.google.com ([2607:f8b0:4864:20::132d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vpwOm-000000003LS-3vGL for pgsql-hackers@lists.postgresql.org; Tue, 10 Feb 2026 22:40:02 +0000 Received: by mail-dy1-x132d.google.com with SMTP id 5a478bee46e88-2ba85f77203so175267eec.1 for ; Tue, 10 Feb 2026 14:40:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770763198; cv=none; d=google.com; s=arc-20240605; b=b/f1y5wj6okQ6+slMOptqi92z+FQqbUIJ/4N9x99dyfI6E7hXR3wHiBEuZuJTqRvoS 9uR/TJOVFBm82SFWVl+9f8grHI49JmC0+OF6ih27DtB5tzLn3ql6iUgsW7lFFhSz/c0N vUBG0GlmckQltImplkTYA6ImKGyiMGoJA0gLztRG0Yr3ziAWZWo1BpzHViv9IUbCDefM NRFrTF3Kw9mbjlGip/ExSOYS4ke81rWnL3OP0F/cGMoqzK/JiPEQgL6SDhHBgA6uwoQt jFvGPuKmM9RFAiKyAEAMpOi56T1ukqzQgDU99hlAqyqRccpCMYHNpweVTHVt9lBYxheS Ixjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:to:subject:message-id:date:from :mime-version:dkim-signature; bh=IR8DbeGqbazVXEyMHiRj9wyPD40z/eEeV9gq1zAvujQ=; fh=dxJXJbLzq9Nah1LUdsj4QTuQ3JoDScd0wp1YHY64NXM=; b=LyCvK09kYjt22FONlmS+5PQReQHEk0slEgdNie403QUFm2LaZp3tavV3BzxCYv2sPJ Ps/1DS3To51X/d1ew9ryNYpIA56YyO3kR7tiwCYpxl/X8ty7WsMkRghYBnlCNQSji2G9 9Y/MYvGjKrawBTiUh6t/afb0rjLIIeQgtcTESTWM2ccYkIWLPCz8zAKacheiWpw0rgak W1OTITu+EAPbT2eCJUo3LR3ozWykVFiFBSQkxd+FrjL2liYUd3yX2X2ZD9zigm8bxUCu 4XryfukEH9PBGK2RDUZx4BvMuEeHjuIrIfPFOHhT7+iMLXI/GHqMhJZlAdouyDAU1bhG 4yJQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770763198; x=1771367998; darn=lists.postgresql.org; h=content-transfer-encoding:to:subject:message-id:date:from :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=IR8DbeGqbazVXEyMHiRj9wyPD40z/eEeV9gq1zAvujQ=; b=ZTwXS3QHPHl3brUx6cWTtviFspSK5gb//dmrn0mBTD/535AnnxIMQ6TPgPzCm/Xpv/ 5MVmBTI8o5h4VqVS7F6WGpWLDdkTPQnxZBy0DncVGsoa0Jy8AYwDXzl7wpAzQHCnA32W UH3QkmG+AsMmJ5044kowYTbqBOtPGJ8KjZ73qh/B0tLGuC2MDnjb6p1agUsbfnockp2K bT79CN8OwmXQrhbTeiACn/nTKTgKjdv89zhcpCqxbcoBfSkkTgPLKyq4/tc3P39Md3eV Eqk/WB4CjOjDsmjGsH2RRqPhp6gdJJaLt7lU4x4pKUBQajNHhAC6DbPsKtIThbWug8Oh 9E1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770763198; x=1771367998; h=content-transfer-encoding:to:subject:message-id:date:from :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=IR8DbeGqbazVXEyMHiRj9wyPD40z/eEeV9gq1zAvujQ=; b=R/0LtO2Q5ReLMHpdlSBkfMpL/5qFnZv4VChS9LXw8KX4l1skh4gn4/BskZkfQ/sU+1 +S5tBbdovQFpXkOs6rsnumfM+6WR2Pk4102ZMmq1pMHjpNnrP1/L8kjTXrj827JQh9aE 03kvSOnf9VkdbJLqOzC1ZmgkdKQY9h5ef3f0MDNUcmQV3ffvdE0ljOk1sOdc3U/A3NDY M56pMQT2OKnL40VdoHpGbJUqr6bzDRqpOjAoLykPtRZ41ldXSCgeqdjsrA7lTdPRcoEZ aeO5LUUt2MXGH8lpcLQHBX7Ee5JclhtKVMifUocnq4GYzAO47np4Q/yJ9jVRi8YEduhf gXoA== X-Gm-Message-State: AOJu0Yx75/GUwPXkONziLCDpy8PwL+gvWR0cyiPrtvot8KJc9IN140CY rRYd082zY9OZftqzCaKMfknMUYjgOFLJNFEXHdqSEOzmS8Ra2qoi4BVOyUtHDYy9oj3Do0v+Jym rfkTL78lOHKh1xaciUU4krTKh+424LDQw/fYI X-Gm-Gg: AZuq6aJaIDfqe1WAvNleM4l+jUCCUNqtsy5TzQoUsE79eP/KYK9dRxSFfc61YrZ1vH6 cMhbWx+v42vyEe+z+0LoJatmT2ZpfeAcKxfJjXn2BeSBBxX6aRZHuWzJJbDS64aAlageGowGYKg kGBz/8urdWXvFfPz+RcOD+CItym+hNpYHHc8xmn13hPWsSRqoao4yyhP7sC56fRJ6RtegEzl0oz miQ5L0q9ZOGxAi9nOwoAVxcdRPPtHzVKPE2JNNyJpAh8jhraHDw0lR5H8Bm7c5WDwfjVQqJx5qK /i3NF5ASE6S1FFpHduVvlWiZhgKdn9Jf3KWGnO+sqgDTfNs5cP16Ue3P9yq+bG9q X-Received: by 2002:a05:7301:3f16:b0:2b7:3678:2d1a with SMTP id 5a478bee46e88-2ba8cdc79d2mr642320eec.6.1770763197328; Tue, 10 Feb 2026 14:39:57 -0800 (PST) MIME-Version: 1.0 From: Thomas Munro Date: Wed, 11 Feb 2026 11:39:20 +1300 X-Gm-Features: AZwV_QjVmTxs_CgqpjJASvobHVtVji0lp_tc0muap4yrFKHXnUDt0RLGdPwhDY0 Message-ID: Subject: Do we still need MULE_INTERNAL? To: PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, MULE_INTERNAL solved a really hard problem years ago and must have been extremely useful, but I think we might be able to drop it now, and I have a patch. If I am wrong about that and there are users who would object, then we should probably improve it instead, and I have some ideas (part of larger reworkings), but first I'd like to establish whether it is already completely obsolete. This history may be very well known to hackers in Japan, but I had to start from zero with my archeologist hat on, and I suspect this is as obscure to many others as it was to me, so here's what I have come up with: In the early nineties (perhaps beginning in the late 80s?), researchers at AIST developed the MULE "meta-encoding" for Nemacs (Nihon Emacs), later merged into Xemacs and GNU Emacs. Unlike early UTF-16-only versions of Unicode, Emacs' internal encoding was multi-byte and backward-compatible with ASCII and traditional in-memory and on-disk representations of text. Aside from lacking a multi-byte encoding, early versions of Unicode also perhaps failed to cover all CJK characters needed for information systems of the time, apparently. It's a simple and clever idea, just messy in the details and a little inefficient: each byte was either ASCII or a lead byte that says which encoding follows (perhaps with light reencoding/escaping in some cases, IDK), so except for ASCII, it was always less efficient by at least one byte than whatever it wraps, but there was nothing it couldn't handle. It could mix around 41 encodings this way, so for the first time you could have (say) Chinese and Arabic in one document in a multi-byte format compatible with traditional conventions. The idea doesn't seem to have been adopted by any other software except PostgreSQL (at least that I could find in quick searches, I'd be interested to hear of any others). That's probably because Unicode gained UTF-8 only a bit later in 1993, providing the missing multi-byte encoding. Instead of referencing 41 other moving standards, it was one unified standard with full international industry support, and neatly fitted into the C strings and existing text file conventions (not to mention other design goals like self-synchronisation). The rest is history. Our implementation of MULE_INTERNAL only supports a few sub-encodings, for Latin, Cyrillic, Chinese, Japanese and Korean, and hasn't been updated to support modern versions of the CJK ones (ie when we got EUC_JIS_2004, we didn't handle the corresponding MULE_INTERNAL lead byte, and I haven't checked the Chinese or Korean situation), which I suspect might be an actionable clue that it is not in use... but I lack the context to say, that's a hypothesis. Our code references the Xemacs project's internals documentation, last published in 1997, with a note added in 2012 that we'd started following GNU's implementation instead, which I think means that mule-conf.el[1] is the closest thing to a standard. We added some more IDs as they were assigned, but they remain unimplemented. (If we actually do need to keep this, perhaps our implementation could dispatch to our "direct" encoding routines instead of open-coding the sub-encodings? That might be hopelessly naive and I can see the combination problem we have and they don't since 23, they only convert to/from Unicode, it's just a thought, but I think something like that would be more like what Emacs is doing IIUC.) Modern GNU Emacs switched to using UTF-8 internally[2] as of Emacs 23 (2009). It can still convert what it calls "Emacs 21 internal format" when loading a file, but I suspect we might be the last ones to support the idea directly as an internal representation. Emacs' internal representation (both old and new) is a technically a superset of Unicode, as they are proud to say, but AFAICT that just means you're free to map your made up script's made up encoding into the 5-byte UTF-8 sequence space not used by Unicode (or in the old system, using private lead bytes), not anything actually useful for our purposes. And if you just want to put your Klingon or Tolkien elvish homework into PostgreSQL, see the ConScript Unicode Registry, it'd use less disk space! More seriously, I think there have been periods when eg JIS rolled out a new standard with characters that Unicode didn't have yet. Unicode simply added them to a minor release (eg 3.2), but for a short time you could have said that Unicode was not a superset or theoretically sufficient. On the other hand, PostgreSQL wouldn't stop you using such hypothetical characters anyway: our UTF-8 validation is for well-formedness, not definedness. There may of course be implications for sorting and classifying, but all of that seems a bit bogus: we stopped updating MULE_INTERNAL even for Japanese, we routinely upgrade Unicode, and locales never worked for MULE_INTERNAL anyway. I also doubt very much that Unicode would be out of the loop on new character assignments in modern times. As for interchange and system boundaries, (1) standard locales on real systems don't come in MULE_INTERNAL encodings so none of that stuff works, (2) the JDBC driver and presumably any driver/language that has its own firm ideas about strings can't support it either, (3) even applications using libpq would be hard pressed to know what text actually means outside ASCII, if they choose it as a client encoding, except perhaps Emacs if you're lucky. The motivation for removing it would be the unnecessary security risks, and maintenance burden for future development in our encoding and locale support. The motivation for keeping it would be that there are users with important data trapped in it. In the absence of hard data, I tried to imagine why you'd want to use it, other than perhaps just "we needed it in 199x and haven't migrated yet". I don't know too much about CJK computing but I am aware of the space issue: commonly used CJK characters take 3 UTF-8 bytes to represent, one more than the national EUC_* encodings. That's a motivation for preferring EUC_*, but let's see how MULE_INTERNAL compares: kanji kana MULE_INTERNAL-wrapped-JISX0208/0212: 3 3 MULE_INTERNAL-wrapped-JISX0201K: N/A 2 UTF8: 3 3 EUC_JP: 2 2 EUC_JIS_2004: 2 2 Since there are two encodings for kana characters and MULE's superpower is to switch, I guess it depends how you chose to encode it and what your ratio of kana to kanji is. Google gives me a first guess of 50/50. I see that the sjis2mic() conversion is clever enough to use JISX0201K for kana, so if your client is speaking SJIS then I suppose you might actually finish up with around ~2.5 bytes per character. That's smaller than UTF-8, and larger than EUC_*. On the other hand, EUC_JIS_2004 handles more Japanese characters, and UTF-8 handles all of the world's scripts. So *maybe* there is a small motivation there, depending on what you think about JIS 2004. I somehow doubt the trade-off makes sense in practice though, you'd be forever dealing with weird problems when some guy called, to pick an example character I googled that is common but missing in the older standard, "=E5=87=9C" needs to appear in your data, if I understood all of that correctly. For Chinese, the calculus is simpler as they only use h=C3=A0nz=C3=AC (~=3D kanji), nothing potentially smaller like kana to affect the average. For Korean, I have no clue. Can any Japanese (or other) experts offer any clues? Concrete questions: * Is anyone actually using MULE_INTERNAL today? * If so, what prevented migration? * Was it ever actually used outside Japan? * Is the lack of interest in the new (22 year old) JIS standard in MULE_INTERNAL meaningful? [1] https://github.com/emacs-mirror/emacs/blob/master/lisp/international/mu= le-conf.el [2] https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Represen= tations.html