MIME-Version: 1.0
References: 
 <CA+hUKG+VEg7OsbRNbRcakp2k+078PCDhZ6HUJjvGvJ839ivxDQ@mail.gmail.com>
 <CAAAe_zANMo3o280YU96Nt=JK=mq=PfygvgT1GnG=7Wuh+Es1GQ@mail.gmail.com>
 <CAAAe_zCktovow1irTy0eD1Lmu2UMQi+DN9uGTFoWrcyXea7SMg@mail.gmail.com>
In-Reply-To: 
 <CAAAe_zCktovow1irTy0eD1Lmu2UMQi+DN9uGTFoWrcyXea7SMg@mail.gmail.com>
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 30 Apr 2026 12:40:52 +1200
Message-ID: 
 <CA+hUKGJvFV3Bd=dxN1C2eOvhxAki363j1jmoxrkw2MkyK_3Kig@mail.gmail.com>
Subject: Re: Experimenting with wider Unicode storage
To: assam258@gmail.com
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Tatsuo Ishii <ishii@postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BhUKGJvFV3Bd%3DdxN1C2eOvhxAki363j1jmoxrkw2MkyK_3Kig%40mail.gmail.com>
Precedence: bulk

On Tue, Apr 21, 2026 at 1:16=E2=80=AFPM Henson Choi <assam258@gmail.com> wr=
ote:
> Thank you again for sharing this exploration, and for including
> Korean in your experiment table.  Rather than comment on the
> patch itself, let me offer a ground-level report on where Korean
> encoding reality sits in April 2026, because the picture has
> shifted enough that I think it is worth entering into the record
> before this thread accumulates momentum on motivations that may
> no longer fully hold on this side of the region.

Hi Henson,

Thank you for this thoughtful and broad feedback, which provided a lot
of useful context.  I appreciated all of it, and have responses to a
couple of the most actionable paragraphs:

> One broader question, then, that I wanted to put to you: there
> are three distinct axes on which utf16 could be pursued =E2=80=94 as a
> server character set, as a data type, or as a compression angle.
> The character-set direction runs straight into the "continuation
> byte must not look like ASCII" rule, as you already noted, and
> is therefore effectively closed on PostgreSQL.  The data-type
> direction is the current patch, which carries substantial
> catalogue and operator surface, while the storage wins mostly
> accrue on wider values =E2=80=94 where columnar + zstd is already doing
> the work.  What still seems genuinely unaddressed in practice is
> the short-value regime: word-sized strings such as names,
> titles, cities, and tags, which fall below the TOAST compression
> threshold and therefore never see a compressor at all.  Would
> framing this as "a compression method effective on word-sized
> values" be a more productive angle than either of the other two?
> The storage outcome could be similar with much less surface area
> to maintain.

Yeah, that is an interesting angle that I hadn't considered, at least
not with that framing.  There are even a couple of Unicode standards
that might apply here, and that I believe some other systems are
using:

https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode
https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode
https://www.unicode.org/notes/tn6/

BOCU-1 maintains binary codepoint order and reports typical
English/French as no size change compared to UTF-8,
Greek/Russian/Arabic/Hebrew as -40%, Hindi as -60% (this makes sense:
it's almost a generalised ISCII, so you get down to one byte per
character in any given Indian language), Japanese as -40% and
Chinese/Korean as -25% (Japanese presumably wins with kana sequences).

One of the ideas already mentioned in comments in the experimental
patch was that the iterator abstraction could allow for incremental
decompression, and I suppose there might be a way to expand BOCU-1 or
similar to UTF-8 incrementally in that layer.  I haven't looked into
that seriously though; so far I had only been thinking of that as a
way of generalising some open coded special cases that appear in a few
places to avoid detoasting.  ICU might also be able to consume it
incrementally, IDK.

zstd etc can clearly compress much more than that, as you say, but
then you have to deal with dictionary problems and it's hard to do
that for small values in a row-oriented system, as you say.  BOCU-1 is
dictionary-free, so you read it in direct byte order with only a tiny
state in a register or two, which seems to be potentially along the
lines you're suggesting.  Food for thought.

> A fair counter on memory, before I go on: disk pressure has
> clearly migrated elsewhere, but shared_buffers and work_mem
> remain finite, and compression primarily addresses the disk
> side.  A data-type approach that goes far enough to shrink the
> in-memory representation =E2=80=94 modifying every string function
> along the way =E2=80=94 tends to become a degraded form of a new
> character set: doing most of the character-set work without the
> character-set slot in PostgreSQL's encoding machinery, which as
> above is closed.  None of the three axes therefore cleanly
> solves the in-memory case; for truly memory-bound CJK workloads
> the honest answer is probably just more RAM.

Yeah.  It's an annoying set of constraints that led me to consider
this, while surveying text handling choices made in lots of database
systems.  Of course it wouldn't be my preference to introduce a new
type, but I couldn't see how how else to fit it in, and since I was
already investigating "modifying every string function along the way"
for other reasons, I wanted to explore what it would take to do that
generically enough to handle something as different as this while
remaining maintainable...

BTW here is the link that I forgot to add to the bottom of my earlier
email as reference [3], which is a blog from when SQL Server
introduced the *opposite* thing: UTF-8 support (like Windows itself,
in 2019).  Previously they had only legacy single/multi-byte encodings
in VARCHAR and UTF-16 in NVARCHAR, so there they were discussing this
tradeoff in reverse, ie space savings for some languages, but reported
25% increase in disk I/O for CJK databases moved to UTF-8.  (I don't
immediately know why SCSU didn't fix that.)

https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-suppor=
t-for-sql-server/734928

> Should you nonetheless decide to press on with utf16 as a data
> type, I am willing to take the patch through a proper review; I
> have already applied it on top of master and confirmed that the
> regression tests pass, so the mechanical footing is in place.

Thanks.  I'm not planning to do more with the "separate UTF-16 type"
concept at this stage, based on your feedback so far.  I am still
working on a couple of text/encoding refactoring prototypes with other
goals, and will try to think about that "special Unicode compression"
angle while doing so.