Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wEzjI-004e7F-0j for pgsql-hackers@arkaria.postgresql.org; Tue, 21 Apr 2026 01:16:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wEzjH-005CiZ-1H for pgsql-hackers@arkaria.postgresql.org; Tue, 21 Apr 2026 01:16:43 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wEzjG-005CiP-2P for pgsql-hackers@lists.postgresql.org; Tue, 21 Apr 2026 01:16:43 +0000 Received: from mail-pl1-x631.google.com ([2607:f8b0:4864:20::631]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wEzjD-000000020Xx-1tfl for pgsql-hackers@lists.postgresql.org; Tue, 21 Apr 2026 01:16:41 +0000 Received: by mail-pl1-x631.google.com with SMTP id d9443c01a7336-2b2429f98d0so20289135ad.2 for ; Mon, 20 Apr 2026 18:16:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1776734199; cv=none; d=google.com; s=arc-20240605; b=OGTNJmXtym5AoHNmdzQ+akZI3mRjieK3T4/3Z3+B3zLnNtj4j/y1P7PSQYKaSrEJmQ IrpmWY0439OLKC6yermPG9Aoqgpvp560zPfnY92OxMExBTuIVvKcXZDBmsJnBQv8HBFI 23HT/MRq+ctvvNg9XPUI/pgNsXvcfLgHybzonLJVBQbp/aFMJkYQzVOPutIufnSn0G6E JdyU4LxDypqurt2gpvWEVJ0fIutBVfQqgvaz1/GtWxI0AHTGCZ+pRDskKnSmek3GX5lE qEXvg8ayuowLMurMTWItVEY39IegjLn3D+1HRRBEef+8wQbLA4lVxEUZzSEalXX7ASIa PTmA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:dkim-signature; bh=F/YpIzWyWRpzRy0DFLqJP2Uz1/Acs8b2LyFcDWvIYRw=; fh=xxJZ3twHezo3mDJqgnqk+XQH1JCpUv4/jyrCCOMsXSQ=; b=QplS7qimHefolQ44VUeKprP7KZTbO0Rqj8ik88nB3V+gJLMy+w7bvzpMGGjdcL/niZ 0piwvh1fmdvCOwo4CxKStiNH3YgYEe0or2nEoMAQyGYTAl1FWMxLhI7cwhtMWOsgVU5H CY3V3fEDMm4F24zp8DorxB7GW7cNPJXUmbjd6v0T+G8Ubkv5f+y+ZVnAppKAL1zIME9L po2LEr0YzPqAEhunpLjkCZJzpr8fixzJ9z8AzuM3vvYFtlAY4rXI79l/arFY/EgfJOU7 L1M9GRok9im5X7dz7BFjYKKPhqQxx1goJc5beCd8+3xxFOUzYZ/QhwcMojP9hwEHNSCG Vtjw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1776734199; x=1777338999; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=F/YpIzWyWRpzRy0DFLqJP2Uz1/Acs8b2LyFcDWvIYRw=; b=MDByprELzOvw45Q4gfOHTYbyysTbTROvPOt0xXMv5ezqJbd14HL/aIOS8WzMXop2uz FsL4shXkJWs/hi6RrDKPA7T4f+M68LEG+QxISQyw88/EX6GF4qiUbJNigsurY/FF3OrZ Eb23lVSPgS/HJvOSwRRu7FzBuqEUIucDNrPXNzZeVrjr122GWugz6MiwJg8HvFwlxL6V rFtx2VHB27yNeN39NyAjSb0WeCf84TRAXVVfoaLFLF69lJ7oGuZyWtQGDBjm74AZP2/B lQhS+mqCWM9+6v8z2E2O4GDVK+5Fe/g2LQS8RNHv/Yrr7F3a4klnpJelC2teDfiSemW0 LMLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776734199; x=1777338999; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=F/YpIzWyWRpzRy0DFLqJP2Uz1/Acs8b2LyFcDWvIYRw=; b=WZJlVHbMrP7KtxN9BnYGW9RUhmcqA3NiEU1F8+n4nsjrL96fY4UeW/MWxUze6hiUNP ssbbkRN59hOaBYiR+je3dEVGFKZAEOOWE7h2qzhjYsw382TpLxSucCajhvYBEX4hy5KN J0F2D06Lx/U5XHeITaEYxJPBiXJK/X1X9GW8JqOL6YeBLXFM/fMqBO2JOwuyMy6jD2Dm uZfpfHz3PkF1eHbU5mnqyQqOPmsEkQQtqQ4j6HE1ccvJZY25rC85NPqh4dePu/TIGrqW 7Uq68XNt5D+dAFjAPzL5DyPOGfiR1h56Sz0J14zMI1JxG8Zgl5kmKNcczbONs//moP3S 9VzQ== X-Gm-Message-State: AOJu0YzvQ74YHuvlM9lzOonXQh8cxnQNZGCW1kn7Wrg+qEHSe/H4MurU 63Cq0kfBhdbQ9F1DYsdCAEHYFwKK/okBPIjL2a3wQSEU0CChZIBEKFcMYRhXre5ADTtF1JFNTzZ LFuWFBcqsU3xmsbxsgirD4KcVfjGMAcg= X-Gm-Gg: AeBDies+x8WntoMlXu7nwUgetm4wExszHVAm+bhE7fBUPrQvzmuz4YCq23IWZQKDRUQ wOMzZazU8HumY0ZbS24fBWGsjMSPdR7U39/AYMwCgTL7JhxoSpwsqRuDIrsMJbbNty5RMO0/VAY eyKTDeg+b4rhxcCdBbSC56HvaVmzVOO5CGTEA8VUI1L8fUdGWijANXS/U8sehPqdS9QPBzlAg9v Fg9BdVk0vAdyZTgvsHZZ7aMmrvcOOC2KwbbFBUDBMQPRCKc1zvaWdEoHhkv2MrjXl+CinWnTlqW mDOiqQEyCUpIZTqkrQoG5wO+BM69Iz/nB4xJQxUHh9tgBQZtWw== X-Received: by 2002:a17:903:3805:b0:2b2:454f:b951 with SMTP id d9443c01a7336-2b5f9f52d33mr160227775ad.29.1776734198514; Mon, 20 Apr 2026 18:16:38 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Reply-To: assam258@gmail.com From: Henson Choi Date: Tue, 21 Apr 2026 10:16:26 +0900 X-Gm-Features: AQROBzAINghF6fEL2iwplHUmD8Ye9SvmXZc3Q-xZmxqWLiMoqNMNF4gkga9gj_Q Message-ID: Subject: Re: Experimenting with wider Unicode storage To: Thomas Munro Cc: PostgreSQL Hackers , Tatsuo Ishii Content-Type: multipart/alternative; boundary="000000000000c34620064fee2a9f" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000c34620064fee2a9f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Thomas, Thank you again for sharing this exploration, and for including Korean in your experiment table. Rather than comment on the patch itself, let me offer a ground-level report on where Korean encoding reality sits in April 2026, because the picture has shifted enough that I think it is worth entering into the record before this thread accumulates momentum on motivations that may no longer fully hold on this side of the region. UTF-8 has already won in Korea, largely by inertia rather than active choice. Public web statistics put .kr sites at roughly 96% UTF-8 with a small EUC-KR residual of about 4% [1] =E2=80=94 noticeably higher than the ~1% Shift-JIS residual on .jp [2], but steadily shrinking. The mechanism is mundane: modern Linux distributions default to UTF-8 locales, PostgreSQL's initdb inherits that, and every new cluster is therefore UTF-8 from birth. The remaining legacy installations are not "haven't migrated yet" =E2=80=94 they are "have decided not to migrate," which is a different and much slower population. A clarification that often trips people up: in Korean practice, "EUC-KR" is the label written down and CP949 is what actually moves on the wire. Microsoft's UHC has been the Windows default for decades, and the MIME label has simply stuck. The historical stack goes KS X 1001 (=EC=99=84=EC=84=B1=ED=98=95, 2,350 syllables) =E2=86= =92 EUC-KR =E2=86=92 CP949 (11,172 syllables) =E2=86=92 UTF-8. PostgreSQL's strict EUC_KR decoder rejects the bytes CP949 adds, which occasionally causes real incidents when Windows-exchanged files are loaded. For any design choice about "Korean legacy support", this matters =E2=80=94 what needs supporting is usually CP949, not EUC-KR proper. Server encoding and client encoding are also routinely split. A common Korean deployment pattern is a PostgreSQL cluster with UTF-8 as server encoding, while legacy Windows / Delphi / C++ / older Java clients connect with client_encoding set to EUC-KR or CP949 and let PostgreSQL transcode at the wire boundary. Many systems that look like "EUC-KR systems" from the outside are actually UTF-8 storage with an EUC-KR wire. The storage-layer share of legacy is therefore probably smaller still than the 3.8% web figure would suggest. On the Korean row of your table landing at -16% under UTF-16: that is structural, not noise. Modern Korean writing mandates word-space separation (unlike Chinese and Japanese), has effectively abandoned hanja since the 1990s, and freely interleaves ASCII acronyms (IT, AI, CEO). As a result Korean carries the highest ASCII share among CJK languages, and UTF-16 pays for each ASCII position (one byte =E2=86=92 two) in exactly the range where the Hangul savings are meant to come from. Columns without spaces =E2=80=94 names, titles, addresses =E2=80=94 could approach = -33%, but general prose cannot. Those same short columns are, however, exactly where the compression angle I return to further below captures the equivalent saving without a new data type. Storage pressure, to the extent modern operators feel it at all, has largely migrated to other layers. Memory and disk have both followed exponential price/volume curves, and the CPU cost of text comparison has disappeared inside other costs =E2=80=94 network, storage I/O, planning, JIT =E2=80=94 to the point of invisibility in profiler output. For OLTP, the 2-vs-3-byte difference on Korean columns does not feel meaningful on modern hardware. For bulk scans where byte counts still do matter, the industry answer has already been columnar + zstd, which routinely reaches 90%+ compression on natural-language text and flattens the CJK-vs-Latin ratio to irrelevance. Embedded and edge are not PostgreSQL's primary target, and archival sits in zstd territory too. The domains that historically motivated "we must narrow CJK storage" have either moved outside the PostgreSQL shape or been absorbed by general-purpose compression. Meanwhile the cultural arrow points toward more Unicode, not less. KakaoTalk (which saturates domestic messaging), Naver comments, Instagram captions, and YouTube normalise emoji in everyday prose, while AI-generated Korean text contributes middle dots, em dashes, and curly quotes at a scale that was not present a few years ago. The share of non-EUC-KR content in everyday Korean prose is, informally, rising steadily. Each emoji is four UTF-8 bytes and is unrepresentable in any legacy encoding at all. A partial-coverage alternative looks increasingly awkward against that trend. Korean upstream feedback on encoding has also been notably quiet despite a very active de-Oracle migration wave in the late 2010s. I suspect this silence is not apathy but absence of a felt problem =E2=80=94 most of the community has simply moved on. I should be careful here. The "Korean side needs narrower CJK storage" argument was genuinely strong around 2010, and I remember when it motivated serious engineering time. It is much weaker in 2026: UTF-8 has won by default, legacy survivors are confined to wire protocols and specific applications, OLTP does not feel the byte cost, and bulk scan is already handled elsewhere. I raise this not to dismiss the technical work =E2=80=94 the patch shows real craft and the exploration is interesting on its own terms. But if the cover-letter motivation rests partly on "this will help East Asian users, including Korea," I wanted you to have a ground-level report: for Korean users specifically, the pressure may no longer be strong enough to justify the complexity described. The calculus may well differ in Japanese or Chinese markets =E2=80=94 that is not for me to say. One broader question, then, that I wanted to put to you: there are three distinct axes on which utf16 could be pursued =E2=80=94 as a server character set, as a data type, or as a compression angle. The character-set direction runs straight into the "continuation byte must not look like ASCII" rule, as you already noted, and is therefore effectively closed on PostgreSQL. The data-type direction is the current patch, which carries substantial catalogue and operator surface, while the storage wins mostly accrue on wider values =E2=80=94 where columnar + zstd is already doing the work. What still seems genuinely unaddressed in practice is the short-value regime: word-sized strings such as names, titles, cities, and tags, which fall below the TOAST compression threshold and therefore never see a compressor at all. Would framing this as "a compression method effective on word-sized values" be a more productive angle than either of the other two? The storage outcome could be similar with much less surface area to maintain. A fair counter on memory, before I go on: disk pressure has clearly migrated elsewhere, but shared_buffers and work_mem remain finite, and compression primarily addresses the disk side. A data-type approach that goes far enough to shrink the in-memory representation =E2=80=94 modifying every string function along the way =E2=80=94 tends to become a degraded form of a new character set: doing most of the character-set work without the character-set slot in PostgreSQL's encoding machinery, which as above is closed. None of the three axes therefore cleanly solves the in-memory case; for truly memory-bound CJK workloads the honest answer is probably just more RAM. One concrete instantiation of that compression angle, if Korean capacity specifically is the example that matters: take CP949 (which is what actually circulates under the EUC-KR label) as a compression base and, for any character CP949 cannot represent, spell it inline as a readable textual escape such as \u2603 or U+2603 rather than a binary marker byte. Native Korean text then stays at two bytes per Hangul, emoji and modern Unicode remain fully representable (at a modest cost per occurrence), the in-memory representation stays plain UTF-8, and the on-disk byte stream stays entirely within ASCII + CP949 =E2=80=94 no new marker byte, no collision with existing code paths that scan for raw ASCII bytes. If the source text itself contains sequences that look like the escape syntax (for instance documentation quoting \u-style literals), a simple doubling rule disambiguates them; such cases are vanishingly rare in Korean business data. This targets exactly the short-value regime above, with far less surface than a new data type. For tighter byte density, one could go further by devising a dedicated binary-level encoding, or by wiring zstd's external dictionary feature into the column-compression path with a pre-trained per-language dictionary =E2=80=94 but either of those paths carries its own implementation and operational costs. Should you nonetheless decide to press on with utf16 as a data type, I am willing to take the patch through a proper review; I have already applied it on top of master and confirmed that the regression tests pass, so the mechanical footing is in place. [1] https://w3techs.com/technologies/segmentation/tld-kr-/character_encodin= g [2] https://w3techs.com/technologies/segmentation/tld-jp-/character_encodin= g Best regards, Henson > --000000000000c34620064fee2a9f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Thoma= s,


Thank you again for sharing this exploration, and for includi= ng
Korean in your experiment table.=C2=A0 Rather than comment on the
= patch itself, let me offer a ground-level report on where Korean
encodin= g reality sits in April 2026, because the picture has
shifted enough tha= t I think it is worth entering into the record
before this thread accumu= lates momentum on motivations that may
no longer fully hold on this side= of the region.


UTF-8 has already won in Korea, largely by inert= ia rather than
active choice.=C2=A0 Public web statistics put .kr sites = at roughly
96% UTF-8 with a small EUC-KR residual of about 4% [1] =E2=80= =94
noticeably higher than the ~1% Shift-JIS residual on .jp [2],
but= steadily shrinking.=C2=A0 The mechanism is mundane: modern Linux
distri= butions default to UTF-8 locales, PostgreSQL's initdb
inherits that,= and every new cluster is therefore UTF-8 from
birth.=C2=A0 The remainin= g legacy installations are not "haven't
migrated yet" =E2= =80=94 they are "have decided not to migrate," which is
a diff= erent and much slower population.


A clarification that often tri= ps people up: in Korean practice,
"EUC-KR" is the label writte= n down and CP949 is what actually
moves on the wire.=C2=A0 Microsoft'= ;s UHC has been the Windows default
for decades, and the MIME label has = simply stuck.=C2=A0 The historical
stack goes KS X 1001 (=EC=99=84=EC=84= =B1=ED=98=95, 2,350 syllables) =E2=86=92 EUC-KR =E2=86=92 CP949
(11,172 = syllables) =E2=86=92 UTF-8.=C2=A0 PostgreSQL's strict EUC_KR decoderrejects the bytes CP949 adds, which occasionally causes real
incidents = when Windows-exchanged files are loaded.=C2=A0 For any
design choice abo= ut "Korean legacy support", this matters =E2=80=94 what
needs = supporting is usually CP949, not EUC-KR proper.


Server encoding = and client encoding are also routinely split. =C2=A0A
common Korean depl= oyment pattern is a PostgreSQL cluster with
UTF-8 as server encoding, wh= ile legacy Windows / Delphi / C++ /
older Java clients connect with clie= nt_encoding set to EUC-KR or
CP949 and let PostgreSQL transcode at the w= ire boundary.=C2=A0 Many
systems that look like "EUC-KR systems&quo= t; from the outside are
actually UTF-8 storage with an EUC-KR wire.=C2= =A0 The storage-layer
share of legacy is therefore probably smaller stil= l than the
3.8% web figure would suggest.


On the Korean row o= f your table landing at -16% under UTF-16:
that is structural, not noise= .=C2=A0 Modern Korean writing mandates
word-space separation (unlike Chi= nese and Japanese), has
effectively abandoned hanja since the 1990s, and= freely
interleaves ASCII acronyms (IT, AI, CEO).=C2=A0 As a result Kore= an
carries the highest ASCII share among CJK languages, and UTF-16
pa= ys for each ASCII position (one byte =E2=86=92 two) in exactly the
range= where the Hangul savings are meant to come from.=C2=A0 Columns
without = spaces =E2=80=94 names, titles, addresses =E2=80=94 could approach -33%,but general prose cannot.=C2=A0 Those same short columns are, however,
= exactly where the compression angle I return to further below
captures t= he equivalent saving without a new data type.


Storage pressure, = to the extent modern operators feel it at all,
has largely migrated to o= ther layers.=C2=A0 Memory and disk have both
followed exponential price/= volume curves, and the CPU cost of
text comparison has disappeared insid= e other costs =E2=80=94 network,
storage I/O, planning, JIT =E2=80=94 to= the point of invisibility in
profiler output.=C2=A0 For OLTP, the 2-vs-= 3-byte difference on Korean
columns does not feel meaningful on modern h= ardware.=C2=A0 For bulk
scans where byte counts still do matter, the ind= ustry answer has
already been columnar + zstd, which routinely reaches 9= 0%+
compression on natural-language text and flattens the
CJK-vs-Lati= n ratio to irrelevance.=C2=A0 Embedded and edge are not
PostgreSQL's= primary target, and archival sits in zstd territory
too.=C2=A0 The doma= ins that historically motivated "we must narrow
CJK storage" h= ave either moved outside the PostgreSQL shape or
been absorbed by genera= l-purpose compression.


Meanwhile the cultural arrow points towar= d more Unicode, not
less.=C2=A0 KakaoTalk (which saturates domestic mess= aging), Naver
comments, Instagram captions, and YouTube normalise emoji = in
everyday prose, while AI-generated Korean text contributes
middle = dots, em dashes, and curly quotes at a scale that was
not present a few = years ago.=C2=A0 The share of non-EUC-KR content
in everyday Korean pros= e is, informally, rising steadily.=C2=A0 Each
emoji is four UTF-8 bytes = and is unrepresentable in any legacy
encoding at all.
A partial-cover= age alternative looks increasingly awkward against
that trend.

Korean upstream feedback on encoding has also been notably quiet
despi= te a very active de-Oracle migration wave in the late 2010s.
I suspect t= his silence is not apathy but absence of a felt
problem =E2=80=94 most o= f the community has simply moved on.


I should be careful here.= =C2=A0 The "Korean side needs narrower CJK
storage" argument w= as genuinely strong around 2010, and I
remember when it motivated seriou= s engineering time.=C2=A0 It is much
weaker in 2026: UTF-8 has won by de= fault, legacy survivors are
confined to wire protocols and specific appl= ications, OLTP does
not feel the byte cost, and bulk scan is already han= dled
elsewhere.=C2=A0 I raise this not to dismiss the technical work =E2= =80=94 the
patch shows real craft and the exploration is interesting on = its
own terms.=C2=A0 But if the cover-letter motivation rests partly on<= br>"this will help East Asian users, including Korea," I wanted y= ou
to have a ground-level report: for Korean users specifically, the
= pressure may no longer be strong enough to justify the complexity
descri= bed.=C2=A0 The calculus may well differ in Japanese or Chinese
markets = =E2=80=94 that is not for me to say.


One broader question, then,= that I wanted to put to you: there
are three distinct axes on which utf= 16 could be pursued =E2=80=94 as a
server character set, as a data type,= or as a compression angle.
The character-set direction runs straight in= to the "continuation
byte must not look like ASCII" rule, as y= ou already noted, and
is therefore effectively closed on PostgreSQL.=C2= =A0 The data-type
direction is the current patch, which carries substant= ial
catalogue and operator surface, while the storage wins mostly
acc= rue on wider values =E2=80=94 where columnar + zstd is already doing
the= work.=C2=A0 What still seems genuinely unaddressed in practice is
the s= hort-value regime: word-sized strings such as names,
titles, cities, and= tags, which fall below the TOAST compression
threshold and therefore ne= ver see a compressor at all.=C2=A0 Would
framing this as "a compres= sion method effective on word-sized
values" be a more productive an= gle than either of the other two?
The storage outcome could be similar w= ith much less surface area
to maintain.


A fair counter on mem= ory, before I go on: disk pressure has
clearly migrated elsewhere, but s= hared_buffers and work_mem
remain finite, and compression primarily addr= esses the disk
side.=C2=A0 A data-type approach that goes far enough to = shrink the
in-memory representation =E2=80=94 modifying every string fun= ction
along the way =E2=80=94 tends to become a degraded form of a newcharacter set: doing most of the character-set work without the
charac= ter-set slot in PostgreSQL's encoding machinery, which as
above is c= losed.=C2=A0 None of the three axes therefore cleanly
solves the in-memo= ry case; for truly memory-bound CJK workloads
the honest answer is proba= bly just more RAM.


One concrete instantiation of that compressio= n angle, if Korean
capacity specifically is the example that matters: ta= ke CP949
(which is what actually circulates under the EUC-KR label) as a=
compression base and, for any character CP949 cannot represent,
spel= l it inline as a readable textual escape such as \u2603 or
U+2603 rather= than a binary marker byte.=C2=A0 Native Korean text
then stays at two b= ytes per Hangul, emoji and modern Unicode
remain fully representable (at= a modest cost per occurrence),
the in-memory representation stays plain= UTF-8, and the on-disk
byte stream stays entirely within ASCII + CP949 = =E2=80=94 no new marker
byte, no collision with existing code paths that= scan for raw
ASCII bytes.=C2=A0 If the source text itself contains sequ= ences that
look like the escape syntax (for instance documentation quoti= ng
\u-style literals), a simple doubling rule disambiguates them;
suc= h cases are vanishingly rare in Korean business data.=C2=A0 This
targets= exactly the short-value regime above, with far less
surface than a new = data type.


For tighter byte density, one could go further by dev= ising a
dedicated binary-level encoding, or by wiring zstd's externa= l
dictionary feature into the column-compression path with a
pre-trai= ned per-language dictionary =E2=80=94 but either of those paths
carries = its own implementation and operational costs.


Should you nonethe= less decide to press on with utf16 as a data
type, I am willing to take = the patch through a proper review; I
have already applied it on top of m= aster and confirmed that the
regression tests pass, so the mechanical fo= oting is in place.


[1] https://w3techs.com/technologie= s/segmentation/tld-kr-/character_encoding
[2] https://w3te= chs.com/technologies/segmentation/tld-jp-/character_encoding

Best regards,
Henson

--000000000000c34620064fee2a9f--