MIME-Version: 1.0
References: 
 <CA+hUKG+VEg7OsbRNbRcakp2k+078PCDhZ6HUJjvGvJ839ivxDQ@mail.gmail.com>
 <CAAAe_zANMo3o280YU96Nt=JK=mq=PfygvgT1GnG=7Wuh+Es1GQ@mail.gmail.com>
In-Reply-To: 
 <CAAAe_zANMo3o280YU96Nt=JK=mq=PfygvgT1GnG=7Wuh+Es1GQ@mail.gmail.com>
Reply-To: assam258@gmail.com
From: Henson Choi <assam258@gmail.com>
Date: Tue, 21 Apr 2026 10:16:26 +0900
Message-ID: 
 <CAAAe_zCktovow1irTy0eD1Lmu2UMQi+DN9uGTFoWrcyXea7SMg@mail.gmail.com>
Subject: Re: Experimenting with wider Unicode storage
To: Thomas Munro <thomas.munro@gmail.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Tatsuo Ishii <ishii@postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000c34620064fee2a9f"
Archived-At: 
 <https://www.postgresql.org/message-id/CAAAe_zCktovow1irTy0eD1Lmu2UMQi%2BDN9uGTFoWrcyXea7SMg%40mail.gmail.com>
Precedence: bulk

--000000000000c34620064fee2a9f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Thomas,


Thank you again for sharing this exploration, and for including
Korean in your experiment table.  Rather than comment on the
patch itself, let me offer a ground-level report on where Korean
encoding reality sits in April 2026, because the picture has
shifted enough that I think it is worth entering into the record
before this thread accumulates momentum on motivations that may
no longer fully hold on this side of the region.


UTF-8 has already won in Korea, largely by inertia rather than
active choice.  Public web statistics put .kr sites at roughly
96% UTF-8 with a small EUC-KR residual of about 4% [1] =E2=80=94
noticeably higher than the ~1% Shift-JIS residual on .jp [2],
but steadily shrinking.  The mechanism is mundane: modern Linux
distributions default to UTF-8 locales, PostgreSQL's initdb
inherits that, and every new cluster is therefore UTF-8 from
birth.  The remaining legacy installations are not "haven't
migrated yet" =E2=80=94 they are "have decided not to migrate," which is
a different and much slower population.


A clarification that often trips people up: in Korean practice,
"EUC-KR" is the label written down and CP949 is what actually
moves on the wire.  Microsoft's UHC has been the Windows default
for decades, and the MIME label has simply stuck.  The historical
stack goes KS X 1001 (=EC=99=84=EC=84=B1=ED=98=95, 2,350 syllables) =E2=86=
=92 EUC-KR =E2=86=92 CP949
(11,172 syllables) =E2=86=92 UTF-8.  PostgreSQL's strict EUC_KR decoder
rejects the bytes CP949 adds, which occasionally causes real
incidents when Windows-exchanged files are loaded.  For any
design choice about "Korean legacy support", this matters =E2=80=94 what
needs supporting is usually CP949, not EUC-KR proper.


Server encoding and client encoding are also routinely split.  A
common Korean deployment pattern is a PostgreSQL cluster with
UTF-8 as server encoding, while legacy Windows / Delphi / C++ /
older Java clients connect with client_encoding set to EUC-KR or
CP949 and let PostgreSQL transcode at the wire boundary.  Many
systems that look like "EUC-KR systems" from the outside are
actually UTF-8 storage with an EUC-KR wire.  The storage-layer
share of legacy is therefore probably smaller still than the
3.8% web figure would suggest.


On the Korean row of your table landing at -16% under UTF-16:
that is structural, not noise.  Modern Korean writing mandates
word-space separation (unlike Chinese and Japanese), has
effectively abandoned hanja since the 1990s, and freely
interleaves ASCII acronyms (IT, AI, CEO).  As a result Korean
carries the highest ASCII share among CJK languages, and UTF-16
pays for each ASCII position (one byte =E2=86=92 two) in exactly the
range where the Hangul savings are meant to come from.  Columns
without spaces =E2=80=94 names, titles, addresses =E2=80=94 could approach =
-33%,
but general prose cannot.  Those same short columns are, however,
exactly where the compression angle I return to further below
captures the equivalent saving without a new data type.


Storage pressure, to the extent modern operators feel it at all,
has largely migrated to other layers.  Memory and disk have both
followed exponential price/volume curves, and the CPU cost of
text comparison has disappeared inside other costs =E2=80=94 network,
storage I/O, planning, JIT =E2=80=94 to the point of invisibility in
profiler output.  For OLTP, the 2-vs-3-byte difference on Korean
columns does not feel meaningful on modern hardware.  For bulk
scans where byte counts still do matter, the industry answer has
already been columnar + zstd, which routinely reaches 90%+
compression on natural-language text and flattens the
CJK-vs-Latin ratio to irrelevance.  Embedded and edge are not
PostgreSQL's primary target, and archival sits in zstd territory
too.  The domains that historically motivated "we must narrow
CJK storage" have either moved outside the PostgreSQL shape or
been absorbed by general-purpose compression.


Meanwhile the cultural arrow points toward more Unicode, not
less.  KakaoTalk (which saturates domestic messaging), Naver
comments, Instagram captions, and YouTube normalise emoji in
everyday prose, while AI-generated Korean text contributes
middle dots, em dashes, and curly quotes at a scale that was
not present a few years ago.  The share of non-EUC-KR content
in everyday Korean prose is, informally, rising steadily.  Each
emoji is four UTF-8 bytes and is unrepresentable in any legacy
encoding at all.
A partial-coverage alternative looks increasingly awkward against
that trend.


Korean upstream feedback on encoding has also been notably quiet
despite a very active de-Oracle migration wave in the late 2010s.
I suspect this silence is not apathy but absence of a felt
problem =E2=80=94 most of the community has simply moved on.


I should be careful here.  The "Korean side needs narrower CJK
storage" argument was genuinely strong around 2010, and I
remember when it motivated serious engineering time.  It is much
weaker in 2026: UTF-8 has won by default, legacy survivors are
confined to wire protocols and specific applications, OLTP does
not feel the byte cost, and bulk scan is already handled
elsewhere.  I raise this not to dismiss the technical work =E2=80=94 the
patch shows real craft and the exploration is interesting on its
own terms.  But if the cover-letter motivation rests partly on
"this will help East Asian users, including Korea," I wanted you
to have a ground-level report: for Korean users specifically, the
pressure may no longer be strong enough to justify the complexity
described.  The calculus may well differ in Japanese or Chinese
markets =E2=80=94 that is not for me to say.


One broader question, then, that I wanted to put to you: there
are three distinct axes on which utf16 could be pursued =E2=80=94 as a
server character set, as a data type, or as a compression angle.
The character-set direction runs straight into the "continuation
byte must not look like ASCII" rule, as you already noted, and
is therefore effectively closed on PostgreSQL.  The data-type
direction is the current patch, which carries substantial
catalogue and operator surface, while the storage wins mostly
accrue on wider values =E2=80=94 where columnar + zstd is already doing
the work.  What still seems genuinely unaddressed in practice is
the short-value regime: word-sized strings such as names,
titles, cities, and tags, which fall below the TOAST compression
threshold and therefore never see a compressor at all.  Would
framing this as "a compression method effective on word-sized
values" be a more productive angle than either of the other two?
The storage outcome could be similar with much less surface area
to maintain.


A fair counter on memory, before I go on: disk pressure has
clearly migrated elsewhere, but shared_buffers and work_mem
remain finite, and compression primarily addresses the disk
side.  A data-type approach that goes far enough to shrink the
in-memory representation =E2=80=94 modifying every string function
along the way =E2=80=94 tends to become a degraded form of a new
character set: doing most of the character-set work without the
character-set slot in PostgreSQL's encoding machinery, which as
above is closed.  None of the three axes therefore cleanly
solves the in-memory case; for truly memory-bound CJK workloads
the honest answer is probably just more RAM.


One concrete instantiation of that compression angle, if Korean
capacity specifically is the example that matters: take CP949
(which is what actually circulates under the EUC-KR label) as a
compression base and, for any character CP949 cannot represent,
spell it inline as a readable textual escape such as \u2603 or
U+2603 rather than a binary marker byte.  Native Korean text
then stays at two bytes per Hangul, emoji and modern Unicode
remain fully representable (at a modest cost per occurrence),
the in-memory representation stays plain UTF-8, and the on-disk
byte stream stays entirely within ASCII + CP949 =E2=80=94 no new marker
byte, no collision with existing code paths that scan for raw
ASCII bytes.  If the source text itself contains sequences that
look like the escape syntax (for instance documentation quoting
\u-style literals), a simple doubling rule disambiguates them;
such cases are vanishingly rare in Korean business data.  This
targets exactly the short-value regime above, with far less
surface than a new data type.


For tighter byte density, one could go further by devising a
dedicated binary-level encoding, or by wiring zstd's external
dictionary feature into the column-compression path with a
pre-trained per-language dictionary =E2=80=94 but either of those paths
carries its own implementation and operational costs.


Should you nonetheless decide to press on with utf16 as a data
type, I am willing to take the patch through a proper review; I
have already applied it on top of master and confirmed that the
regression tests pass, so the mechanical footing is in place.


[1] https://w3techs.com/technologies/segmentation/tld-kr-/character_encodin=
g
[2] https://w3techs.com/technologies/segmentation/tld-jp-/character_encodin=
g


Best regards,
Henson

>

--000000000000c34620064fee2a9f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><font face=3D"arial, sans-serif">Hi Thoma=
s,<br><br><br>Thank you again for sharing this exploration, and for includi=
ng<br>Korean in your experiment table.=C2=A0 Rather than comment on the<br>=
patch itself, let me offer a ground-level report on where Korean<br>encodin=
g reality sits in April 2026, because the picture has<br>shifted enough tha=
t I think it is worth entering into the record<br>before this thread accumu=
lates momentum on motivations that may<br>no longer fully hold on this side=
 of the region.<br><br><br>UTF-8 has already won in Korea, largely by inert=
ia rather than<br>active choice.=C2=A0 Public web statistics put .kr sites =
at roughly<br>96% UTF-8 with a small EUC-KR residual of about 4% [1] =E2=80=
=94<br>noticeably higher than the ~1% Shift-JIS residual on .jp [2],<br>but=
 steadily shrinking.=C2=A0 The mechanism is mundane: modern Linux<br>distri=
butions default to UTF-8 locales, PostgreSQL&#39;s initdb<br>inherits that,=
 and every new cluster is therefore UTF-8 from<br>birth.=C2=A0 The remainin=
g legacy installations are not &quot;haven&#39;t<br>migrated yet&quot; =E2=
=80=94 they are &quot;have decided not to migrate,&quot; which is<br>a diff=
erent and much slower population.<br><br><br>A clarification that often tri=
ps people up: in Korean practice,<br>&quot;EUC-KR&quot; is the label writte=
n down and CP949 is what actually<br>moves on the wire.=C2=A0 Microsoft&#39=
;s UHC has been the Windows default<br>for decades, and the MIME label has =
simply stuck.=C2=A0 The historical<br>stack goes KS X 1001 (=EC=99=84=EC=84=
=B1=ED=98=95, 2,350 syllables) =E2=86=92 EUC-KR =E2=86=92 CP949<br>(11,172 =
syllables) =E2=86=92 UTF-8.=C2=A0 PostgreSQL&#39;s strict EUC_KR decoder<br=
>rejects the bytes CP949 adds, which occasionally causes real<br>incidents =
when Windows-exchanged files are loaded.=C2=A0 For any<br>design choice abo=
ut &quot;Korean legacy support&quot;, this matters =E2=80=94 what<br>needs =
supporting is usually CP949, not EUC-KR proper.<br><br><br>Server encoding =
and client encoding are also routinely split. =C2=A0A<br>common Korean depl=
oyment pattern is a PostgreSQL cluster with<br>UTF-8 as server encoding, wh=
ile legacy Windows / Delphi / C++ /<br>older Java clients connect with clie=
nt_encoding set to EUC-KR or<br>CP949 and let PostgreSQL transcode at the w=
ire boundary.=C2=A0 Many<br>systems that look like &quot;EUC-KR systems&quo=
t; from the outside are<br>actually UTF-8 storage with an EUC-KR wire.=C2=
=A0 The storage-layer<br>share of legacy is therefore probably smaller stil=
l than the<br>3.8% web figure would suggest.<br><br><br>On the Korean row o=
f your table landing at -16% under UTF-16:<br>that is structural, not noise=
.=C2=A0 Modern Korean writing mandates<br>word-space separation (unlike Chi=
nese and Japanese), has<br>effectively abandoned hanja since the 1990s, and=
 freely<br>interleaves ASCII acronyms (IT, AI, CEO).=C2=A0 As a result Kore=
an<br>carries the highest ASCII share among CJK languages, and UTF-16<br>pa=
ys for each ASCII position (one byte =E2=86=92 two) in exactly the<br>range=
 where the Hangul savings are meant to come from.=C2=A0 Columns<br>without =
spaces =E2=80=94 names, titles, addresses =E2=80=94 could approach -33%,<br=
>but general prose cannot.=C2=A0 Those same short columns are, however,<br>=
exactly where the compression angle I return to further below<br>captures t=
he equivalent saving without a new data type.<br><br><br>Storage pressure, =
to the extent modern operators feel it at all,<br>has largely migrated to o=
ther layers.=C2=A0 Memory and disk have both<br>followed exponential price/=
volume curves, and the CPU cost of<br>text comparison has disappeared insid=
e other costs =E2=80=94 network,<br>storage I/O, planning, JIT =E2=80=94 to=
 the point of invisibility in<br>profiler output.=C2=A0 For OLTP, the 2-vs-=
3-byte difference on Korean<br>columns does not feel meaningful on modern h=
ardware.=C2=A0 For bulk<br>scans where byte counts still do matter, the ind=
ustry answer has<br>already been columnar + zstd, which routinely reaches 9=
0%+<br>compression on natural-language text and flattens the<br>CJK-vs-Lati=
n ratio to irrelevance.=C2=A0 Embedded and edge are not<br>PostgreSQL&#39;s=
 primary target, and archival sits in zstd territory<br>too.=C2=A0 The doma=
ins that historically motivated &quot;we must narrow<br>CJK storage&quot; h=
ave either moved outside the PostgreSQL shape or<br>been absorbed by genera=
l-purpose compression.<br><br><br>Meanwhile the cultural arrow points towar=
d more Unicode, not<br>less.=C2=A0 KakaoTalk (which saturates domestic mess=
aging), Naver<br>comments, Instagram captions, and YouTube normalise emoji =
in<br>everyday prose, while AI-generated Korean text contributes<br>middle =
dots, em dashes, and curly quotes at a scale that was<br>not present a few =
years ago.=C2=A0 The share of non-EUC-KR content<br>in everyday Korean pros=
e is, informally, rising steadily.=C2=A0 Each<br>emoji is four UTF-8 bytes =
and is unrepresentable in any legacy<br>encoding at all.<br>A partial-cover=
age alternative looks increasingly awkward against<br>that trend.<br><br><b=
r>Korean upstream feedback on encoding has also been notably quiet<br>despi=
te a very active de-Oracle migration wave in the late 2010s.<br>I suspect t=
his silence is not apathy but absence of a felt<br>problem =E2=80=94 most o=
f the community has simply moved on.<br><br><br>I should be careful here.=
=C2=A0 The &quot;Korean side needs narrower CJK<br>storage&quot; argument w=
as genuinely strong around 2010, and I<br>remember when it motivated seriou=
s engineering time.=C2=A0 It is much<br>weaker in 2026: UTF-8 has won by de=
fault, legacy survivors are<br>confined to wire protocols and specific appl=
ications, OLTP does<br>not feel the byte cost, and bulk scan is already han=
dled<br>elsewhere.=C2=A0 I raise this not to dismiss the technical work =E2=
=80=94 the<br>patch shows real craft and the exploration is interesting on =
its<br>own terms.=C2=A0 But if the cover-letter motivation rests partly on<=
br>&quot;this will help East Asian users, including Korea,&quot; I wanted y=
ou<br>to have a ground-level report: for Korean users specifically, the<br>=
pressure may no longer be strong enough to justify the complexity<br>descri=
bed.=C2=A0 The calculus may well differ in Japanese or Chinese<br>markets =
=E2=80=94 that is not for me to say.<br><br><br>One broader question, then,=
 that I wanted to put to you: there<br>are three distinct axes on which utf=
16 could be pursued =E2=80=94 as a<br>server character set, as a data type,=
 or as a compression angle.<br>The character-set direction runs straight in=
to the &quot;continuation<br>byte must not look like ASCII&quot; rule, as y=
ou already noted, and<br>is therefore effectively closed on PostgreSQL.=C2=
=A0 The data-type<br>direction is the current patch, which carries substant=
ial<br>catalogue and operator surface, while the storage wins mostly<br>acc=
rue on wider values =E2=80=94 where columnar + zstd is already doing<br>the=
 work.=C2=A0 What still seems genuinely unaddressed in practice is<br>the s=
hort-value regime: word-sized strings such as names,<br>titles, cities, and=
 tags, which fall below the TOAST compression<br>threshold and therefore ne=
ver see a compressor at all.=C2=A0 Would<br>framing this as &quot;a compres=
sion method effective on word-sized<br>values&quot; be a more productive an=
gle than either of the other two?<br>The storage outcome could be similar w=
ith much less surface area<br>to maintain.<br><br><br>A fair counter on mem=
ory, before I go on: disk pressure has<br>clearly migrated elsewhere, but s=
hared_buffers and work_mem<br>remain finite, and compression primarily addr=
esses the disk<br>side.=C2=A0 A data-type approach that goes far enough to =
shrink the<br>in-memory representation =E2=80=94 modifying every string fun=
ction<br>along the way =E2=80=94 tends to become a degraded form of a new<b=
r>character set: doing most of the character-set work without the<br>charac=
ter-set slot in PostgreSQL&#39;s encoding machinery, which as<br>above is c=
losed.=C2=A0 None of the three axes therefore cleanly<br>solves the in-memo=
ry case; for truly memory-bound CJK workloads<br>the honest answer is proba=
bly just more RAM.<br><br><br>One concrete instantiation of that compressio=
n angle, if Korean<br>capacity specifically is the example that matters: ta=
ke CP949<br>(which is what actually circulates under the EUC-KR label) as a=
<br>compression base and, for any character CP949 cannot represent,<br>spel=
l it inline as a readable textual escape such as \u2603 or<br>U+2603 rather=
 than a binary marker byte.=C2=A0 Native Korean text<br>then stays at two b=
ytes per Hangul, emoji and modern Unicode<br>remain fully representable (at=
 a modest cost per occurrence),<br>the in-memory representation stays plain=
 UTF-8, and the on-disk<br>byte stream stays entirely within ASCII + CP949 =
=E2=80=94 no new marker<br>byte, no collision with existing code paths that=
 scan for raw<br>ASCII bytes.=C2=A0 If the source text itself contains sequ=
ences that<br>look like the escape syntax (for instance documentation quoti=
ng<br>\u-style literals), a simple doubling rule disambiguates them;<br>suc=
h cases are vanishingly rare in Korean business data.=C2=A0 This<br>targets=
 exactly the short-value regime above, with far less<br>surface than a new =
data type.<br><br><br>For tighter byte density, one could go further by dev=
ising a<br>dedicated binary-level encoding, or by wiring zstd&#39;s externa=
l<br>dictionary feature into the column-compression path with a<br>pre-trai=
ned per-language dictionary =E2=80=94 but either of those paths<br>carries =
its own implementation and operational costs.<br><br><br>Should you nonethe=
less decide to press on with utf16 as a data<br>type, I am willing to take =
the patch through a proper review; I<br>have already applied it on top of m=
aster and confirmed that the<br>regression tests pass, so the mechanical fo=
oting is in place.<br><br><br>[1] <a href=3D"https://w3techs.com/technologi=
es/segmentation/tld-kr-/character_encoding">https://w3techs.com/technologie=
s/segmentation/tld-kr-/character_encoding</a><br>[2] <a href=3D"https://w3t=
echs.com/technologies/segmentation/tld-jp-/character_encoding">https://w3te=
chs.com/technologies/segmentation/tld-jp-/character_encoding</a><br><br><br=
>Best regards,<br>Henson</font><br></div><div class=3D"gmail_quote gmail_qu=
ote_container"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"ltr">
</div>
</blockquote></div></div>

--000000000000c34620064fee2a9f--