Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIFTE-007ukc-1n for pgsql-hackers@arkaria.postgresql.org; Thu, 30 Apr 2026 00:41:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wIFTC-005CTs-0l for pgsql-hackers@arkaria.postgresql.org; Thu, 30 Apr 2026 00:41:34 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIFTB-005CTj-2b for pgsql-hackers@lists.postgresql.org; Thu, 30 Apr 2026 00:41:33 +0000 Received: from mail-dl1-x1231.google.com ([2607:f8b0:4864:20::1231]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wIFT7-00000003QQK-2cSw for pgsql-hackers@lists.postgresql.org; Thu, 30 Apr 2026 00:41:32 +0000 Received: by mail-dl1-x1231.google.com with SMTP id a92af1059eb24-12dc1c0b724so44057c88.1 for ; Wed, 29 Apr 2026 17:41:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1777509689; cv=none; d=google.com; s=arc-20240605; b=dyIvXzYe1g8zgoATmiad0WMqx9E1IchxgSUeudzFiWFC41DGUdPTrk7AOcat7exU4M Ne8VbaDZ68z9kgrEKwoG5Y7E7h098qJwqI7/NgLrvHNEd1F+4ek0i9LOU/0Rd2bkyyII p0XWF82haGL0bjYUh52PlbqyUvY86OE0a61iSp4+ztIBbs6WvRBRlv2kE31M9XMgsDxM OxZpgux3115utXCWKZPm0l4zm2KH0rx7sO60Jk/qTNcOPXAbb8dH2oT68tJAuSUWWAiM H1IJiTFidRDJKSm4hUe1h//T4L7345Hesjd4vMDkP5EdvN/B4BgMum5BMpfirot9MR2R ADJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Exme7x6wqvrPNSoGRcTsjuRd4zyih5JJYuZtITyXcmg=; fh=2UGkfAkzepSnL8IIfYCVKvWqyARAn3RGLkRx2laYqok=; b=eJs0Ak4MfLxr+dUAUbkG4r0KtQ1diU15DtHaRt4bzuvGXrdswtriBxphL7/X5BZFnc 1UsRr9QIJAALe1AAcf/MEo8OQBr/kTLfD4LRX9TyhbJdtZ9HOIKzjD6WMqEA6pukyZkN 0ZaH8F1yt8zzpVEbbDOvlcZScJrIflcSCFI1+lvOtpbMGEzz2yH5P9Hes7i3/N9xtZr7 L29j6Bu+j1BM7gQ2YOcmKDh7qBWkuwCf5gSeUKl/gjO3RofR+tghCJRnTtlBz1auuoGv W3EJi2OWCjue0Qgb2/sil35ML3pIqJibWVDOv+lKFblRUZpmZMIieZOG9LjdR7rgtk7R uygQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777509689; x=1778114489; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Exme7x6wqvrPNSoGRcTsjuRd4zyih5JJYuZtITyXcmg=; b=H6eQ2q2lS2wdFUPvVRX2SVz/S8zFIOGqhrFUHtblo0cQclZ2CNtCeP6TdJjtWj5GMi 19a3uu5HBYv6sv2/LuHI075uf+1VnhDhaExvdIRyRg6Ow6NJPcWvZOHpeMD66rg9eZ0n Y3xwSpMoBSQAMyWFwQ50h2Snr4II9bWMFENLib7JmQOD1wZ/toGKmrjVQGhHUTo/UNm3 kQBqH5IMab+cKaRah/XF/C+sfG226hOAXulCN7PMb35XDQgI+UGlsXzNhrqDLd8N3POe +Q3gQ/967MKre0+Kpn6oskFjy00sZvP1zvxgGsoydJYUHH7fffSDVcO4pssgPtF8cO9I nFrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777509689; x=1778114489; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Exme7x6wqvrPNSoGRcTsjuRd4zyih5JJYuZtITyXcmg=; b=XBIumAUiaMPcBJUOXHtRtPwpnBl32Zu+y+ifpFKgO3NkRXCYUa8m3vr8p2efATH0RN 6+2C/c7tHCwNPz9PJ+T6jTSwH6CEF+F4d4Qy/YvSwQH7pwtinDMFyrCRe7VPQRf9G4+T 0uPcJpzI9P4PGui7GAVs3ZzlRQ0SKkPd1IhdfIn6sY1uhKoYMeK1xjPZFnM76cJsEWVW mdW21/Rt9m8h8LgzIiY9XDIljSVe0TKYNJ93efuyzBMJlwuBu+qlzK/tNglriB+bqj/c BJQQOAe9l3vQlQVJIYAIQRfaljhcLcWSeFxTbG6ehOmbnAMNPjrKSilfBeU9+6Snh8F0 OY6w== X-Gm-Message-State: AOJu0Ywe75FfL9PfxMWOFodsd3zvz3madYlBSKbkYkkhLR30p2kZtkEo kimonXqOQi9iTK+bS1pElKji4joqFSsmY3tfnIvON4YeLthV12rXLhsnCBxMWJQVmTGKGjZb8Mr M7fkVx154Pt4GuYMJO3d0pNLbelyUAWw= X-Gm-Gg: AeBDiesS4CJ4zKkmB9VFQRk139+dv0MOTZj3Y7HePgaThvmrZE/Ta+GP8LCWpw/zd1P xA8MQfZ96zB9C0CkVdzjtoSVbbEc2Gd6wmEQYKAbvuD1cPyNaReo9fYqlNJtJEKEFjEaCW2TBx+ ipCJTJnnvTlqLD2FanPm+HbFrrpMuYbFiTVZlH9LPeM5shA2UhWCv6MP3b6BnJ+ZjxzEpK+1vta S8NUuJQH1LI9oq1G5vL1FpdONMIBs63SYIf5f5Dr/ZzjC25P2hKGH74ZK6hnFi0bQIxrhvGS4jJ pTmiDuF70vbeA1C+yDPjE2VL63pW1LV30VI1fbHmSqwf6YQPKKa0eHBiHtMOdwZou9CSly/lgqY = X-Received: by 2002:a05:7300:1491:b0:2cb:de38:c76f with SMTP id 5a478bee46e88-2ed3dcb6e76mr119258eec.6.1777509689024; Wed, 29 Apr 2026 17:41:29 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Thomas Munro Date: Thu, 30 Apr 2026 12:40:52 +1200 X-Gm-Features: AVHnY4Js7EkjmrkmeC0c8f136N3psOdiODsFwRwd5O8mlPdhBbu-CjzYsQ6SQXI Message-ID: Subject: Re: Experimenting with wider Unicode storage To: assam258@gmail.com Cc: PostgreSQL Hackers , Tatsuo Ishii Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Apr 21, 2026 at 1:16=E2=80=AFPM Henson Choi wr= ote: > Thank you again for sharing this exploration, and for including > Korean in your experiment table. Rather than comment on the > patch itself, let me offer a ground-level report on where Korean > encoding reality sits in April 2026, because the picture has > shifted enough that I think it is worth entering into the record > before this thread accumulates momentum on motivations that may > no longer fully hold on this side of the region. Hi Henson, Thank you for this thoughtful and broad feedback, which provided a lot of useful context. I appreciated all of it, and have responses to a couple of the most actionable paragraphs: > One broader question, then, that I wanted to put to you: there > are three distinct axes on which utf16 could be pursued =E2=80=94 as a > server character set, as a data type, or as a compression angle. > The character-set direction runs straight into the "continuation > byte must not look like ASCII" rule, as you already noted, and > is therefore effectively closed on PostgreSQL. The data-type > direction is the current patch, which carries substantial > catalogue and operator surface, while the storage wins mostly > accrue on wider values =E2=80=94 where columnar + zstd is already doing > the work. What still seems genuinely unaddressed in practice is > the short-value regime: word-sized strings such as names, > titles, cities, and tags, which fall below the TOAST compression > threshold and therefore never see a compressor at all. Would > framing this as "a compression method effective on word-sized > values" be a more productive angle than either of the other two? > The storage outcome could be similar with much less surface area > to maintain. Yeah, that is an interesting angle that I hadn't considered, at least not with that framing. There are even a couple of Unicode standards that might apply here, and that I believe some other systems are using: https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode https://www.unicode.org/notes/tn6/ BOCU-1 maintains binary codepoint order and reports typical English/French as no size change compared to UTF-8, Greek/Russian/Arabic/Hebrew as -40%, Hindi as -60% (this makes sense: it's almost a generalised ISCII, so you get down to one byte per character in any given Indian language), Japanese as -40% and Chinese/Korean as -25% (Japanese presumably wins with kana sequences). One of the ideas already mentioned in comments in the experimental patch was that the iterator abstraction could allow for incremental decompression, and I suppose there might be a way to expand BOCU-1 or similar to UTF-8 incrementally in that layer. I haven't looked into that seriously though; so far I had only been thinking of that as a way of generalising some open coded special cases that appear in a few places to avoid detoasting. ICU might also be able to consume it incrementally, IDK. zstd etc can clearly compress much more than that, as you say, but then you have to deal with dictionary problems and it's hard to do that for small values in a row-oriented system, as you say. BOCU-1 is dictionary-free, so you read it in direct byte order with only a tiny state in a register or two, which seems to be potentially along the lines you're suggesting. Food for thought. > A fair counter on memory, before I go on: disk pressure has > clearly migrated elsewhere, but shared_buffers and work_mem > remain finite, and compression primarily addresses the disk > side. A data-type approach that goes far enough to shrink the > in-memory representation =E2=80=94 modifying every string function > along the way =E2=80=94 tends to become a degraded form of a new > character set: doing most of the character-set work without the > character-set slot in PostgreSQL's encoding machinery, which as > above is closed. None of the three axes therefore cleanly > solves the in-memory case; for truly memory-bound CJK workloads > the honest answer is probably just more RAM. Yeah. It's an annoying set of constraints that led me to consider this, while surveying text handling choices made in lots of database systems. Of course it wouldn't be my preference to introduce a new type, but I couldn't see how how else to fit it in, and since I was already investigating "modifying every string function along the way" for other reasons, I wanted to explore what it would take to do that generically enough to handle something as different as this while remaining maintainable... BTW here is the link that I forgot to add to the bottom of my earlier email as reference [3], which is a blog from when SQL Server introduced the *opposite* thing: UTF-8 support (like Windows itself, in 2019). Previously they had only legacy single/multi-byte encodings in VARCHAR and UTF-16 in NVARCHAR, so there they were discussing this tradeoff in reverse, ie space savings for some languages, but reported 25% increase in disk I/O for CJK databases moved to UTF-8. (I don't immediately know why SCSU didn't fix that.) https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-suppor= t-for-sql-server/734928 > Should you nonetheless decide to press on with utf16 as a data > type, I am willing to take the patch through a proper review; I > have already applied it on top of master and confirmed that the > regression tests pass, so the mechanical footing is in place. Thanks. I'm not planning to do more with the "separate UTF-16 type" concept at this stage, based on your feedback so far. I am still working on a couple of text/encoding refactoring prototypes with other goals, and will try to think about that "special Unicode compression" angle while doing so.