Experimenting with wider Unicode storage

public inbox for [email protected]  
help / color / mirror / Atom feed

Experimenting with wider Unicode storage
4+ messages / 2 participants
[nested] [flat]

* Experimenting with wider Unicode storage
@ 2026-04-15 13:43  Thomas Munro <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Thomas Munro @ 2026-04-15 13:43 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>; +Cc: Tatsuo Ishii <[email protected]>; Henson Choi <[email protected]>

Hi,

We only allow one character encoding per database.  The SQL standard
and most comparable RDBMSs are more flexible, though the details vary.
In most, you can attach CHARACTER SET to column definitions, a whole
CREATE TABLE (non-standard), CREATE DOMAIN, CREATE SCHEMA and probably
more places.  In some, encoding is implied by COLLATE instead
(non-standard).  When different encodings meet, text is transcoded,
which works but is bad for performance.

Our single database encoding also has some restrictions: it must
encode ASCII as ASCII, and any byte that is part of a multibyte
sequence must not look like ASCII, since many code paths require that
and fixing that is hard.  That excludes a few encodings that people
want.

That's all OK by now, as modern information systems use Unicode
everywhere.  There is one practical problem that causes people to
complain[2] about PostgreSQL though: about half of the global
population uses a language that arbitrarily gained a byte per
character by switching to UTF-8 compared to various legacy encodings
or GB18030.  That's storage and RAM that you have to pay for up front
and forever (space, I/O[3]).

I wondered about inventing new PostgreSQL-backend-compatible encodings
that swizzle bits around to make CJK/I languages fit, but I kept
coming back to Unicode.

In many systems there is a special way to use UTF-16, which gets you
back to around two bytes per character for Chinese, Japanese, Korean
and probably Indian languages.  In MySQL/MariaDB, you can use various
encodings of UTF-16 as a CHARACTER SET with the normal text types, and
in Oracle, DB2, SQL Server/Sybase you can use a separate NVARCHAR type
for UTF-16, alongside the regular text types whose encodings are
controlled by CHARACTER SET or COLLATE.  N* types are shorthand for
the standardese "NATIONAL <TYPE>", a string in an unspecified special
encoding, for which they all chose some kind of UTF-16.

At first I thought that with first class extension types as one of our
superpowers, we might be able to do that with an extension, but the
problem there is that text is so tangled up with locales, and I
certainly didn't want to convert all over the place.  So I tried to
hack up a minimal demonstration of what a separate UTF-16 text type
might look like as a core data type.

You can save a lot of space if you have a separate "utf16" type
("national text"?), but only for East and South Asian languages:

+ language | octets  | delta |                     string
+----------+---------+-------+-------------------------------------------------
+ English  | 45→90   | +100% | In a hole in the ground there lived a hobbit.
+ Spanish  | 44→86   |  +95% | En un agujero en el suelo, vivía un hobbit.
+ Russian  | 59→66   |  +12% | В норе под землей жил-был хоббит.
+ Arabic   | 57→64   |  +12% | كان يعيش هوبيت في حفرة في الأرض.
+ Hebrew   | 43→48   |  +12% | בתוך חור באדמה חי הוביט.
+ Greek    | 74→82   |  +11% | Σε μια τρύπα στο έδαφος ζούσε ένα χόμπιτ.
+ Korean   | 55→46   |  -16% | 땅속 어느 구멍에 한 호빗이 살고 있었다.
+ Hindi    | 111→86  |  -23% | जमीन में बने एक गड्ढे में एक हॉबिट रहता था।
+ Tamil    | 146→108 |  -26% | அந்த நிலத்தில் ஒரு துளையில் ஒரு ஹாபிட்
வசித்து வந்தது.
+ Chinese  | 51→34   |  -33% | 在地下一个洞里，住着一个霍比特人。
+ Japanese | 66→44   |  -33% | 穴のなかに、ひとりのホビットが暮らしていた。

It's actually -33% in Korean too without spaces, or if you use
double-width spaces, a more common stylistic choice IIUC (this Korean
string happens to use ASCII spaces).  The same sort of thing explains
why Russian, Arabic, Hewbrew and Greek lose ~12% instead of breaking
even: they share space and some punctuation with ASCII.

The attached is highly exploratory concept code to try the idea out
and see if experts in CJK computing, defenders of the type system and
others think it might be worth exploring further.  Some of the
technical challenges and observations I spotted along the way:

* "text" (etc) and "utf16" need to be comparable incrementally without
conversion
* N^2 explosions in cross-type support function definitions must
surely be avoided
* the solution to that is surely generic programming, but PostgreSQL
is written in ye olde C
* therefore, this POC patch is big on macros as poor-man's C++
* to support fast paths like memcmp()-based ucs_basic comparison you
need big endian UTF-16
* to stuff UTF-16 into varlena you need to tolerate unaligned access
* by a happy coincidence, ICU supports big endian, unaligned UTF-16
* Windows libc locales might in theory allow UTF-16, but that'd have
to be native endian, which I didn't implement
* other systems actually allow both endiannesses as encodings or
subtypes, at the user's option
* they also allow you to control whether surrogates are allowed

In this patch you can see some string iterator concepts that I have
been hacking on for an entirely different purpose, namely trying to
figure out how to make our multibyte string support go faster (and
also be safer) by hoisting all the character-at-a-time loops out to
specialisations in string handling functions.  That's not shown here.
Neither is any kind of silent transcoding that plagues other systems
that do this kind of thing:

* the "utf16" type is only allowed to contain text that is a ASCII,
LATIN1 or Unicode, depending on the database encoding
* for UTF8, it's all of Unicode, for which mb_iterator can trivally
produce UTF-16 or UTF-32
* for LATIN1, that becomes trivial casting since LATIN1 is a strict
subset of Unicode
* for anything else, utf16 only allows ASCII characters, as a degraded
mode just to allow the tests to pass

In other words, so far utf16 is not allowed to represent anything that
"text" could not represent and convert trivially.  That was originally
a decision to make a quick proof-of-concept plausible, but maybe it's
even a good idea...

I have no doubt that there are lots of complicated problems that I
haven't met yet, when you add more types.  What I was trying to
explore here was whether you can exclude most of them by providing
enough conversion-free (incremental) cross-type support.  This is
something SQL Server DBAs talk about: mixtures of NVARCHAR and VARCHAR
columns and index befuddle the planner and introduce hidden execution
costs if you're not careful.  It struck me that with the above
restriction you could perhaps keep all strings cheaply and
incrementally comparable.  There are still some things you can't do:

* if you're comparing "text" with "utf16" then you lose the
length-based not-equal fast path
* that's a big deal if it means detoasting

Sharing for discussion only.  It has enough working to support
converting individual columns text->utf16 and use btree indexes.  Many
more text functions would need to be converted to generic form, and
many more support functions would be needed for full functionality.

(Individual Indian languages could in theory be compacted even further
to single-byte ISCII.  ISCII is in general infeasible as a server
encoding because it has stateful shifts between many scripts, but
single-script variants of ISCII as supported on some Unixen would in
theory be plausible.  Since India has so many languages and scripts
and information systems often need to support all of them, I am
reliably informed that UTF-8 reigns supreme there despite taking 3
bytes to represent the tiny 6 bit (?) character set of any individual
language like Devanagari (Hindi etc).  GNU/Linux doesn't even support
ISCII, so that idea is basically DOA.)

[1] https://www.postgresql.org/docs/current/infoschema-character-sets.html
[2] https://www.postgresql.org/message-id/flat/ME2PR01MB2532E72B514DC46ED0E10F798A0C0%40ME2PR01MB2532.au...


Attachments:

  [text/x-patch] 0001-pg_stack_alloc-Provide-API-for-stack-allocation.patch (15.6K, 2-0001-pg_stack_alloc-Provide-API-for-stack-allocation.patch)
  download | inline diff:
From 65ed1421fe1469bd7152db9009b3a22cf82a9ecb Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Fri, 27 Feb 2026 22:21:36 +1300
Subject: [PATCH 1/2] pg_stack_alloc: Provide API for stack allocation.

Several places have long used a large array on the stack for temporary
objects of dynamic size, and switched to palloc()/pfree() when it was
full.  This commit provides a centralized API for that pattern.

It is implemented the same way, but the interface looks like palloc.h's,
including the type-safety features added by commit 2016055a and later
developments.  A small set of functions like strdup() is also provided,
for common operations.

Later patches will adopt the new API in various places.

Reviewed-by:
Discussion: https://postgr.es/m/CA%2BhUKG%2BixUUYOGRcuZpkk5pmJZaUQv6VCPAjdTZXFP5vA8jxcA%40mail.gmail.com
---
 src/include/utils/pg_stack_alloc.h      | 315 ++++++++++++++++++++++++
 src/test/regress/expected/internals.out |  12 +
 src/test/regress/parallel_schedule      |   2 +-
 src/test/regress/regress.c              |  51 ++++
 src/test/regress/sql/internals.sql      |  11 +
 5 files changed, 390 insertions(+), 1 deletion(-)
 create mode 100644 src/include/utils/pg_stack_alloc.h
 create mode 100644 src/test/regress/expected/internals.out
 create mode 100644 src/test/regress/sql/internals.sql

diff --git a/src/include/utils/pg_stack_alloc.h b/src/include/utils/pg_stack_alloc.h
new file mode 100644
index 00000000000..bb8c58597b9
--- /dev/null
+++ b/src/include/utils/pg_stack_alloc.h
@@ -0,0 +1,315 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_stack_alloc.h
+ *		Allocator for objects that don't escape the current function.
+ *
+ * A palloc()-like interface for allocating memory on the stack.  The initial
+ * implementation uses an array declared statically.
+ *
+ * Once stack space is exhausted, allocations silently fall back to using
+ * palloc().  Memory should therefore still be freed explicitly with
+ * pg_stack_free() or MemoryContext-level cleanup.  It is a no-op in the
+ * common case that pfree() doesn't need to be called.
+ *
+ * XXX A space-limited version of alloca() could be added.
+ *
+ * XXX It might be possible to use something like "defer" or equivalent
+ * compiler extensions to clean up palloc()'d memory automatically, in future
+ * work, and then pg_stack_free() would not be necessary.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/pg_stack_alloc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_STACK_ALLOC_H
+#define PG_STACK_ALLOC_H
+
+#include "utils/elog.h"
+#include "utils/palloc.h"
+#include "miscadmin.h"
+
+#include <limits.h>
+#include <unistd.h>
+
+
+/* #define PG_STACK_USE_PALLOC_LOG "/tmp/pg_stack_alloc.csv" */
+
+/* Choose which implementation to use, if not already defined manually. */
+#if !defined(PG_STACK_USE_ARRAY) &&									\
+	!defined(PG_STACK_USE_PALLOC) &&								\
+	!defined(PG_STACK_USE_PALLOC_LOG)
+#define PG_STACK_USE_ARRAY
+#endif
+
+
+/*-------------------------------------------------------------------------
+ *
+ * Public API.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* This ensures that "align" is a constant and can't cause overflow. */
+#define PG_STACK_MAX_ALIGN 4096
+
+/* Declare a stack allocator with a default size limit. */
+#define DECLARE_PG_STACK()												\
+	DECLARE_PG_STACK_SIZE(128)
+
+/*
+ * As above, but with a caller-supplied limit on stack usage.  The default
+ * should be preferred.
+ */
+#define DECLARE_PG_STACK_SIZE(size) \
+	bool pg_stack_maybe_pfree pg_attribute_unused() = false;			\
+	size_t pg_stack_let_size;		/* temp, avoids double eval */		\
+	DECLARE_PG_STACK_IMPL(size)
+
+/* Allocate memory, optionally with explicit alignment. */
+#define pg_stack_alloc(size)											\
+	pg_stack_alloc_aligned((size), MAXIMUM_ALIGNOF)
+#define pg_stack_alloc_aligned(size, align)								\
+	(pg_stack_sanity_checks(align),										\
+	 pg_stack_let_size = (size),										\
+	 pg_stack_alloc_aligned_impl(pg_stack_let_size, (align)))
+
+/* As above, but also zero the memory. */
+#define pg_stack_alloc0(size) \
+	pg_stack_alloc0_aligned((size), MAXIMUM_ALIGNOF)
+#define pg_stack_alloc0_aligned(size, align)							\
+	(pg_stack_sanity_checks(align),										\
+	 pg_stack_let_size = (size),										\
+	 memset(pg_stack_alloc_aligned_impl(pg_stack_let_size, (align)),	\
+			0,															\
+			pg_stack_let_size))
+
+/* As above, but for a given type T. */
+#define pg_stack_alloc_object(T)										\
+	pg_stack_alloc_array(T, 1)
+#define pg_stack_alloc0_object(T)										\
+	pg_stack_alloc0_array(T, 1)
+
+/* As above, but for an array of objects of size T. */
+#define pg_stack_alloc_array(T, n)										\
+	(pg_stack_sanity_checks(alignof(T)), 								\
+	 StaticAssertExpr(sizeof(n) <= sizeof(size_t), "n too wide"), 		\
+	 pg_stack_let_size = pg_stack_T_mul_n(sizeof(T), sizeof(n), (n)),	\
+	 pg_stack_alloc_aligned_impl(pg_stack_let_size, alignof(T)))
+#define pg_stack_alloc0_array(T, n)										\
+	(pg_stack_sanity_checks(alignof(T)), 								\
+	 StaticAssertExpr(sizeof(n) <= sizeof(size_t), "n too wide"), 		\
+	 pg_stack_let_size = pg_stack_T_mul_n(sizeof(T), sizeof(n), (n)),	\
+	 pg_stack_alloc0_aligned(pg_stack_let_size, (alignof(T))))
+
+/* Copy a string. */
+#define pg_stack_strdup(cstr)											\
+	pg_stack_strdup_with_len((cstr), strlen(cstr))
+#define pg_stack_strndup(cstr, n)										\
+	pg_stack_strdup_with_len((cstr), strnlen((cstr), (n)))
+#define pg_stack_strdup_with_len(data, size)							\
+	(pg_stack_sanity_checks(1),											\
+	 pg_stack_let_size = (size),										\
+	 pg_stack_strdup_with_len_impl(										\
+		 pg_stack_alloc_aligned_impl(pg_stack_let_size + 1,				\
+									 alignof(char)),					\
+		 (data),														\
+		 pg_stack_let_size))
+#define pg_stack_text_to_cstring(text) \
+	pg_stack_strdup_with_len(VARDATA_ANY(text), VARSIZE_ANY_EXHDR(text))
+#define pg_stack_text_datum_to_cstring(datum)							\
+	pg_stack_text_to_cstring((text *) DatumGetPointer(datum))
+
+/*
+ * Free memory allocated with the above interfaces.  We don't expect to
+ * receive pointers allocated by palloc() directly and not via this API.  That
+ * would break the pg_stack_maybe_pfree optimization, and might limit
+ * future implementation techniques.
+ */
+#define pg_stack_free(ptr)												\
+	do																	\
+	{																	\
+		void *pg_stack_let_ptr = (ptr);									\
+		Assert(pg_stack_ptr_p(pg_stack_let_ptr) ||						\
+			   pg_stack_maybe_pfree);									\
+		if (unlikely(pg_stack_maybe_pfree) &&							\
+			!pg_stack_ptr_p(pg_stack_let_ptr))							\
+			pfree(pg_stack_let_ptr);									\
+	}																	\
+	while (0)
+
+
+/*-------------------------------------------------------------------------
+ *
+ * Private helper code common to all implementations.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Normal usage would have constant align values, so the first two checks
+ * should ideally be static assertions, but they are done this way to allow
+ * regress.c to loop through alignment sizes.
+ */
+#define pg_stack_sanity_checks(align)									\
+	(AssertMacro((align) > 0 && (align) <= PG_STACK_MAX_ALIGN),			\
+	 AssertMacro(((align) & ((align) - 1)) == 0 /* power-of-two? */))
+
+/* For assertions. */
+static inline bool
+pg_stack_ptr_is_aligned_p(const void *p, size_t align)
+{
+	return (uintptr_t) p % align == 0;
+}
+
+/* Maximum value that can be stored in an unsigned integer of given size. */
+static inline size_t
+pg_stack_max_for_uint_size(size_t size)
+{
+	Assert(size <= sizeof(size_t));
+	return SIZE_MAX >> ((sizeof(size_t) * CHAR_BIT) - size * CHAR_BIT);
+}
+
+/* Post-allocation part of pg_stack_strdup_with_len(). */
+static inline char *
+pg_stack_strdup_with_len_impl(char *dst, const char *data, size_t size)
+{
+	memcpy(dst, data, size);
+	dst[size] = 0;
+	return dst;
+}
+
+/* Compute sizeof(T) * n. */
+static inline size_t
+pg_stack_T_mul_n(size_t sizeof_T, size_t sizeof_n, size_t n)
+{
+	/* XXX Could overflow, which might allow the stack to be corrupted. */
+	return sizeof_T * n;
+}
+
+/*
+ * Fall back to palloc() or palloc_aligned() due to lack of space.  We waste a
+ * register remembering if we've ever had to do this, to generate better
+ * straight-line code for the case where we don't have to free anything.
+ */
+#define pg_stack_palloc_aligned(size, align)							\
+	((pg_stack_maybe_pfree = true),										\
+	 ((align) > MAXIMUM_ALIGNOF ?										\
+	  palloc_aligned((size), (align), 0) :								\
+	  palloc(size)))
+
+
+/*-------------------------------------------------------------------------
+ *
+ * Low-level implementations below this point supply the following macros:
+ *
+ * 1. DECLARE_PG_STACK_IMPL(size)
+ * 2. pg_stack_alloc_aligned(size, align)
+ * 3. pg_stack_ptr_p(ptr)
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*-------------------------------------------------------------------------
+ *
+ * Toy implementations for debugging.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* Just use palloc. */
+#ifdef PG_STACK_USE_PALLOC
+#define DECLARE_PG_STACK_IMPL(size)
+#define pg_stack_alloc_aligned_impl(size, align)						\
+	pg_stack_palloc_aligned((size), (align))
+#define pg_stack_ptr_p() false
+#endif
+
+/*
+ * Same, but log "location,function,size,depth" entries to a file, using the
+ * PG_STACK_USE_PALLOC_LOG macro's definition (a ""-quoted string literal) for
+ * the path.
+ */
+#ifdef PG_STACK_USE_PALLOC_LOG
+#define DECLARE_PG_STACK_IMPL(size)										\
+	FILE *pg_stack_log													\
+	__attribute__((cleanup(pg_stack_close_log))) =						\
+	fopen(PG_STACK_USE_PALLOC_LOG, "a+")
+#define pg_stack_alloc_aligned_impl(size, align)						\
+	(fprintf(pg_stack_log,												\
+			 "%s:%d,%s,%zu,%zu\n",										\
+			 __FILE__,													\
+			 __LINE__,													\
+			 __func__,													\
+			 (size_t) (size),											\
+			 ((const char *) stack_base_ptr -							\
+			  (const char *) __builtin_stack_address())),				\
+	 pg_stack_palloc_aligned((size), (align)))
+#define pg_stack_ptr_p(ptr) false
+static inline void
+pg_stack_close_log(FILE **f)
+{
+	fclose(*f);
+}
+#endif
+
+/*-------------------------------------------------------------------------
+ *
+ * Array-based implementation.
+ *
+ * This is entirely standard C requiring no compiler extensions, but it leaves
+ * a big hole in the stack when you call another function and has no ability
+ * to respect the total stack size limit.
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifdef PG_STACK_USE_ARRAY
+
+/* Required interface macros. */
+
+#define DECLARE_PG_STACK_IMPL(size)										\
+	char pg_stack_array[(size)];										\
+	char *pg_stack_p = pg_stack_array + (size)
+
+#define pg_stack_alloc_aligned_impl(size, align)						\
+	(pg_stack_alloc_aligned_from_array(&pg_stack_array[0],				\
+									   &pg_stack_p,						\
+									   (size),							\
+									   (align)) ?						\
+	 pg_stack_p :														\
+	 pg_stack_palloc_aligned(size, align))
+
+#define pg_stack_ptr_p(ptr)												\
+	((char *) (ptr) >= &pg_stack_array[0] &&							\
+	 (char *) (ptr) <= &pg_stack_array[sizeof(pg_stack_array)])
+
+
+/* Implementation. */
+
+static inline bool
+pg_stack_alloc_aligned_from_array(char *array,
+								  char **p,
+								  size_t size,
+								  size_t align)
+{
+	if (likely(size <= (uintptr_t) *p)) /* avoid overflow with huge size */
+	{
+		char	   *result = *p - size;
+
+		if (align > 1)
+			result = (char *) TYPEALIGN_DOWN(align, result);
+
+		if (likely(result >= array))
+		{
+			*p = result;
+			return true;
+		}
+	}
+	return false;
+}
+
+#endif
+
+#endif
diff --git a/src/test/regress/expected/internals.out b/src/test/regress/expected/internals.out
new file mode 100644
index 00000000000..532e9f791ed
--- /dev/null
+++ b/src/test/regress/expected/internals.out
@@ -0,0 +1,12 @@
+\getenv libdir PG_LIBDIR
+\getenv dlsuffix PG_DLSUFFIX
+\set regresslib :libdir '/regress' :dlsuffix
+CREATE FUNCTION test_pg_stack_alloc() RETURNS void
+    AS :'regresslib' LANGUAGE C STRICT;
+SELECT test_pg_stack_alloc();
+ test_pg_stack_alloc 
+---------------------
+ 
+(1 row)
+
+DROP FUNCTION test_pg_stack_alloc();
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 288e94dc408..37a97dbedc7 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -48,7 +48,7 @@ test: create_index create_index_spgist create_view index_including index_includi
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_aggregate create_function_sql create_cast constraints triggers select inherit typed_table vacuum drop_if_exists updatable_views roleattributes create_am hash_func errors infinite_recurse create_property_graph for_portion_of
+test: create_aggregate create_function_sql create_cast constraints triggers select inherit typed_table vacuum drop_if_exists updatable_views roleattributes create_am hash_func errors infinite_recurse create_property_graph for_portion_of internals
 
 # ----------
 # sanity_check does a vacuum, affecting the sort order of SELECT *
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 2bcb5559a45..74fe6fdce9d 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "commands/trigger.h"
+#include "common/int.h"
 #include "common/pg_lzcompress.h"
 #include "executor/executor.h"
 #include "executor/functions.h"
@@ -47,6 +48,7 @@
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 #include "utils/memutils.h"
+#include "utils/pg_stack_alloc.h"
 #include "utils/rel.h"
 #include "utils/typcache.h"
 
@@ -1488,3 +1490,52 @@ test_pglz_decompress(PG_FUNCTION_ARGS)
 	SET_VARSIZE(result, dlen + VARHDRSZ);
 	PG_RETURN_BYTEA_P(result);
 }
+
+PG_FUNCTION_INFO_V1(test_pg_stack_alloc);
+Datum
+test_pg_stack_alloc(PG_FUNCTION_ARGS)
+{
+	char	   *p;
+	char	   *p2 PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Need extra space to run under ASAN. */
+	DECLARE_PG_STACK_SIZE(1024 * 8);
+
+	/* Too big for the stack. */
+	p = pg_stack_alloc(10000);
+	Assert(!pg_stack_ptr_p(p));
+	pg_stack_free(p);
+
+	/* Acceptable size. */
+	p = pg_stack_alloc(10);
+	Assert(pg_stack_ptr_p(p));
+
+	/* Addresses should move downwards. */
+	p2 = pg_stack_alloc(10);
+	Assert(pg_stack_ptr_p(p));
+	Assert(p2 < p);
+
+	/* A zero-sized allocation is identifiable as a stack address. */
+	p = pg_stack_alloc(0);
+	Assert(pg_stack_ptr_p(p));
+
+	/* Test a range of alignments. */
+#define TEST_ALIGN MAXIMUM_ALIGNOF
+	for (int i = 1; i <= TEST_ALIGN * 8; i *= 2)
+	{
+		/* Allocate and check alignment is as requested, when we use palloc(). */
+		p = pg_stack_alloc_aligned(1024 * 8, i);
+
+		Assert(!pg_stack_ptr_p(p));
+		Assert(pg_stack_ptr_is_aligned_p(p, i));
+		pg_stack_free(p);
+
+		/* Allocate and check alignment is as requested. */
+		p = pg_stack_alloc_aligned(i, i);
+
+		Assert(pg_stack_ptr_p(p));
+		Assert(pg_stack_ptr_is_aligned_p(p, i));
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/regress/sql/internals.sql b/src/test/regress/sql/internals.sql
new file mode 100644
index 00000000000..bea61780f8b
--- /dev/null
+++ b/src/test/regress/sql/internals.sql
@@ -0,0 +1,11 @@
+\getenv libdir PG_LIBDIR
+\getenv dlsuffix PG_DLSUFFIX
+
+\set regresslib :libdir '/regress' :dlsuffix
+
+CREATE FUNCTION test_pg_stack_alloc() RETURNS void
+    AS :'regresslib' LANGUAGE C STRICT;
+
+SELECT test_pg_stack_alloc();
+
+DROP FUNCTION test_pg_stack_alloc();
-- 
2.47.3



  [text/x-patch] 0002-Provide-utf16-type.patch (94.3K, 3-0002-Provide-utf16-type.patch)
  download | inline diff:
From be07e2a0ad1a0da67a505a5184729b8c7dcd023d Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Tue, 17 Feb 2026 21:13:54 +1300
Subject: [PATCH 2/2] Provide utf16 type.

Science project...
---
 src/backend/utils/adt/pg_locale.c         | 103 +++++
 src/backend/utils/adt/pg_locale_icu.c     |  53 +++
 src/backend/utils/adt/varlena.c           | 468 +++++++++++++++++++-
 src/include/c.h                           |   5 +
 src/include/catalog/pg_amop.dat           |  79 ++++
 src/include/catalog/pg_amproc.dat         |  12 +
 src/include/catalog/pg_opclass.dat        |   2 +
 src/include/catalog/pg_operator.dat       | 160 +++++++
 src/include/catalog/pg_proc.dat           | 130 ++++++
 src/include/catalog/pg_type.dat           |   8 +
 src/include/fmgr.h                        |   3 +
 src/include/mb/pg_wchar.h                 |  34 ++
 src/include/mb/string_iterator.h          | 500 ++++++++++++++++++++++
 src/include/mb/unicode_strings.h          | 254 +++++++++++
 src/include/mb/unicode_types.h            |  61 +++
 src/include/utils/pg_locale.h             |  20 +
 src/test/regress/expected/encoding.out    |  32 ++
 src/test/regress/expected/type_sanity.out |   1 +
 src/test/regress/expected/unicode.out     | 134 ++++++
 src/test/regress/regress.c                |  16 +
 src/test/regress/sql/encoding.sql         |  16 +
 src/test/regress/sql/type_sanity.sql      |   1 +
 src/test/regress/sql/unicode.sql          | 101 +++++
 src/tools/pgindent/typedefs.list          |   4 +
 24 files changed, 2185 insertions(+), 12 deletions(-)
 create mode 100644 src/include/mb/string_iterator.h
 create mode 100644 src/include/mb/unicode_strings.h
 create mode 100644 src/include/mb/unicode_types.h

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6c5c1019e1e..9231d48ea3f 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -42,6 +42,7 @@
 #include "common/hashfn.h"
 #include "common/string.h"
 #include "mb/pg_wchar.h"
+#include "mb/unicode_strings.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/guc_hooks.h"
@@ -50,6 +51,7 @@
 #include "utils/pg_locale.h"
 #include "utils/pg_locale_c.h"
 #include "utils/relcache.h"
+#include "utils/pg_stack_alloc.h"
 #include "utils/syscache.h"
 
 #ifdef WIN32
@@ -1407,6 +1409,107 @@ pg_strncoll(const char *arg1, ssize_t len1, const char *arg2, ssize_t len2,
 	return locale->collate->strncoll(arg1, len1, arg2, len2, locale);
 }
 
+/*
+ * For providers without UTF-16 support, convert both strings to database
+ * encoding with NUL-terminator.
+ */
+static int
+pg_strncoll_char16_convert(const storage_char16_t *data1, size_t size1,
+						   const storage_char16_t *data2, size_t size2,
+						   pg_locale_t locale)
+{
+	char	   *cstr1;
+	char	   *cstr2;
+	int			result;
+
+	DECLARE_PG_STACK();
+
+	cstr1 = pg_stack_alloc(char16_to_mb_max_size(size1) + 1);
+	char16_to_local_cstr(cstr1, data1, size1);
+
+	cstr2 = pg_stack_alloc(char16_to_mb_max_size(size2) + 1);
+	char16_to_local_cstr(cstr2, data2, size2);
+
+	result = pg_strncoll(cstr1, -1, cstr2, -1, locale);
+
+	pg_stack_free(cstr1);
+	pg_stack_free(cstr2);
+
+	return result;
+}
+
+/*
+ * Collate two UTF-16 strings.
+ */
+int
+pg_strncoll_char16(const storage_char16_t *data1, size_t size1,
+				   const storage_char16_t *data2, size_t size2,
+				   pg_locale_t locale)
+{
+	if (locale->collate->strncoll_char16)
+		return locale->collate->strncoll_char16(data1, size1,
+												data2, size2,
+												locale);
+
+	return pg_strncoll_char16_convert(data1, size1, data2, size2, locale);
+}
+
+/*
+ * For providers without UTF-16 support, convert one string to database
+ * encoding with NUL-terminator.
+ */
+static int
+pg_strncoll_char16_local_convert(const storage_char16_t *data1, size_t size1,
+								 const char *data2, size_t size2,
+								 pg_locale_t locale)
+{
+	char	   *cstr1;
+	int			result;
+
+	DECLARE_PG_STACK();
+
+	cstr1 = pg_stack_alloc(char16_to_mb_max_size(size1) + 1);
+	char16_to_local_cstr(cstr1, data1, size1);
+
+	result = pg_strncoll(cstr1, -1, data2, size2, locale);
+
+	pg_stack_free(cstr1);
+
+	return result;
+}
+
+/*
+ * Compare a UTF-16 string and a database encoding string.
+ */
+int
+pg_strncoll_char16_local(const storage_char16_t *data1, size_t size1,
+						 const char *data2, size_t size2,
+						 pg_locale_t locale)
+{
+	if (locale->collate->strncoll_char16_local)
+		return locale->collate->strncoll_char16_local(data1, size2,
+													  data2, size2,
+													  locale);
+
+	return pg_strncoll_char16_local_convert(data1, size1, data2, size2,
+											locale);
+}
+
+/*
+ * Compare a UTF-16 string and a database encoding string.
+ */
+int
+pg_strncoll_local_char16(const char *data1, size_t size1,
+						 const storage_char16_t *data2, size_t size2,
+						 pg_locale_t locale)
+{
+	/* No seperate implementation for now. */
+	int			result = pg_strncoll_char16_local(data2, size2, data1, size1, locale);
+
+	INVERT_COMPARE_RESULT(result);
+	return result;
+}
+
 /*
  * Return true if the collation provider supports pg_strxfrm() and
  * pg_strnxfrm(); otherwise false.
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index a4a4e82eb9e..6bf617f8e56 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -77,6 +77,12 @@ static size_t downcase_ident_icu(char *dst, size_t dstsize, const char *src,
 static int	strncoll_icu(const char *arg1, ssize_t len1,
 						 const char *arg2, ssize_t len2,
 						 pg_locale_t locale);
+static int	strncoll_char16_icu(const storage_char16_t *arg1, size_t len1,
+								const storage_char16_t *arg2, size_t len2,
+								pg_locale_t locale);
+static int	strncoll_char16_utf8_icu(const storage_char16_t *arg1, size_t len1,
+									 const char *arg2, size_t len2,
+									 pg_locale_t locale);
 static size_t strnxfrm_icu(char *dest, size_t destsize,
 						   const char *src, ssize_t srclen,
 						   pg_locale_t locale);
@@ -165,6 +171,8 @@ static const struct collate_methods collate_methods_icu_utf8 = {
 #else
 	.strncoll = strncoll_icu,
 #endif
+	.strncoll_char16 = strncoll_char16_icu,
+	.strncoll_char16_local = strncoll_char16_utf8_icu,
 	.strnxfrm = strnxfrm_icu,
 	.strnxfrm_prefix = strnxfrm_prefix_icu_utf8,
 	.strxfrm_is_safe = true,
@@ -1062,6 +1070,51 @@ strncoll_icu(const char *arg1, ssize_t len1,
 	return result;
 }
 
+static int
+strncoll_char16_icu(const storage_char16_t *data1, size_t size1,
+					const storage_char16_t *data2, size_t size2,
+					pg_locale_t locale)
+{
+	UCharIterator iter1;
+	UCharIterator iter2;
+	UErrorCode	status;
+	int			result;
+
+	UITER_SET_STORAGE_CHAR_T(&iter1, data1, size1);
+	UITER_SET_STORAGE_CHAR_T(&iter2, data2, size2);
+
+	status = U_ZERO_ERROR;
+	result = ucol_strcollIter(locale->icu.ucol, &iter1, &iter2, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("collation failed: %s", u_errorName(status))));
+
+	return result;
+}
+
+static int
+strncoll_char16_utf8_icu(const storage_char16_t *data1, size_t size1,
+						 const char *data2, size_t size2,
+						 pg_locale_t locale)
+{
+	UCharIterator iter1;
+	UCharIterator iter2;
+	UErrorCode	status;
+	int			result;
+
+	/* Iterators for unaligned big-endian UTF-16 and UTF-8. */
+	UITER_SET_STORAGE_CHAR_T(&iter1, data1, size1);
+	uiter_setUTF8(&iter2, data2, size2);
+
+	status = U_ZERO_ERROR;
+	result = ucol_strcollIter(locale->icu.ucol, &iter1, &iter2, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("collation failed: %s", u_errorName(status))));
+
+	return result;
+}
+
 /* 'srclen' of -1 means the strings are NUL-terminated */
 static size_t
 strnxfrm_prefix_icu(char *dest, size_t destsize,
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c0ff51bd2fc..174aac26c48 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -30,6 +30,7 @@
 #include "funcapi.h"
 #include "lib/hyperloglog.h"
 #include "libpq/pqformat.h"
+#include "mb/unicode_strings.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "parser/scansup.h"
@@ -40,6 +41,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_stack_alloc.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
 #include "utils/varlena.h"
@@ -4728,7 +4730,7 @@ text_reverse(PG_FUNCTION_ARGS)
 	text	   *str = PG_GETARG_TEXT_PP(0);
 	const char *p = VARDATA_ANY(str);
 	int			len = VARSIZE_ANY_EXHDR(str);
-	const char *endp = p + len;
+	mb_iterator iter = MB_ITERATOR_INIT_LOCAL(p, len);
 	text	   *result;
 	char	   *dst;
 
@@ -4739,22 +4741,16 @@ text_reverse(PG_FUNCTION_ARGS)
 	if (pg_database_encoding_max_length() > 1)
 	{
 		/* multibyte version */
-		while (p < endp)
-		{
-			int			sz;
-
-			sz = pg_mblen_range(p, endp);
-			dst -= sz;
-			memcpy(dst, p, sz);
-			p += sz;
-		}
+		while (mb_iterator_has_more(&iter))
+			dst -= mb_iterator_store_before(&iter, dst);
 	}
 	else
 	{
 		/* single byte version */
-		while (p < endp)
-			*(--dst) = *p++;
+		while (mb_iterator_has_more(&iter))
+			dst -= mb_iterator_store_before__sb(&iter, dst);
 	}
+	Assert(dst = VARDATA_ANY(result));
 
 	PG_RETURN_TEXT_P(result);
 }
@@ -5809,3 +5805,451 @@ invalid_pair:
 			 errmsg("invalid Unicode surrogate pair")));
 	PG_RETURN_NULL();			/* keep compiler quiet */
 }
+
+/*
+ * UTF-16 strings.  These provide more compact storage of some languages that
+ * would otherwise require 3-byte UTF8 seuences for their core character set.
+ * They are never directly exposed to clients without converted to database
+ * encoding..
+ */
+
+/*
+ * Allocate a new utf16 with space for up to max_utf16_size char16_t
+ * codepoints.  When converting from database encoding, multiply source bytes
+ * by MAX_UTF16_CODEPOINTS_PER_MBLEN.
+ */
+static inline utf16 *
+utf16_new(size_t max_utf16_size)
+{
+	utf16	   *result = palloc(VARHDRSZ + max_utf16_size * sizeof(char16_t));
+
+	SET_VARSIZE(result, VARHDRSZ + sizeof(char16_t) * max_utf16_size);
+	return result;
+}
+
+static inline storage_char16_t *
+utf16_data(utf16 *value)
+{
+	return (storage_char16_t *) VARDATA_ANY(value);
+}
+
+/*
+ * Return number of char16_t characters in a utf16.
+ */
+static inline size_t
+utf16_size(const utf16 *value)
+{
+	return VARSIZE_ANY_EXHDR(value) / sizeof(storage_char16_t);
+}
+
+/*
+ * Convenience macro for initializing char16_iterator directly from a utf16
+ * varlena object.
+ */
+#define CHAR16_ITERATOR_INIT_WITH_UTF16(o) \
+	CHAR16_ITERATOR_INIT(utf16_data(o), utf16_size(o))
+
+/*
+ * Set number of char16_t characters in a utf16, if it differs from the
+ * estimate given to utf16_new().  It can't be set larger.
+ */
+static inline void
+utf16_set_size(utf16 *value, size_t size)
+{
+	size_t		varsize = VARHDRSZ + sizeof(storage_char16_t) * size;
+
+	Assert(varsize <= VARSIZE(value));
+	SET_VARSIZE(value, varsize);
+}
+
+/*
+ * Construct a new utf16 from database encoding.
+ */
+static inline utf16 *
+utf16_new_from_local(const char *src, size_t src_size)
+{
+	utf16	   *result;
+	storage_char16_t *dst;
+	size_t		dst_size;
+
+	result = utf16_new(mb_to_char16_max_size(src_size));
+	dst = utf16_data(result);
+	dst_size = local_to_char16(dst, src, src_size);
+	utf16_set_size(result, dst_size);
+
+	return result;
+}
+
+Datum
+utf16_length(PG_FUNCTION_ARGS)
+{
+	utf16	   *u = PG_GETARG_UTF16_PP(0);
+	char16_iterator iter = CHAR16_ITERATOR_INIT_WITH_UTF16(u);
+	int32		result = 0;
+
+	while (char16_iterator_has_more(&iter))
+	{
+		result++;
+		char16_iterator_advance(&iter);
+	}
+
+	PG_RETURN_INT32(result);
+}
+
+Datum
+utf16_octet_length(PG_FUNCTION_ARGS)
+{
+	Datum		str = PG_GETARG_DATUM(0);
+
+	PG_RETURN_INT32((toast_raw_datum_size(str) - VARHDRSZ));
+}
+
+Datum
+utf16out(PG_FUNCTION_ARGS)
+{
+	utf16	   *value = PG_GETARG_UTF16_PP(0);
+	const storage_char16_t *src = utf16_data(value);
+	size_t		src_size = utf16_size(value);
+	char	   *dst_cstr;
+
+	dst_cstr = palloc(char16_to_mb_max_size(src_size) + 1);
+	char16_to_local_cstr(dst_cstr, src, src_size);
+
+	PG_RETURN_CSTRING(dst_cstr);
+}
+
+Datum
+utf16in(PG_FUNCTION_ARGS)
+{
+	const char *src_cstr = PG_GETARG_CSTRING(0);
+	size_t		src_size = strlen(src_cstr);
+	utf16	   *result;
+
+	result = utf16_new_from_local(src_cstr, src_size);
+
+	PG_RETURN_UTF16_P(result);
+}
+
+Datum
+utf16recv(PG_FUNCTION_ARGS)
+{
+	StringInfo	buf = (StringInfo) PG_GETARG_POINTER(0);
+	char	   *src;
+	int			src_size;
+	utf16	   *result;
+
+	src = pq_getmsgtext(buf, buf->len - buf->cursor, &src_size);
+	result = utf16_new_from_local(src, src_size);
+	pfree(src);
+
+	PG_RETURN_UTF16_P(result);
+}
+
+Datum
+utf16send(PG_FUNCTION_ARGS)
+{
+	utf16	   *value = PG_GETARG_UTF16_PP(0);
+	StringInfoData buf;
+	const storage_char16_t *src = utf16_data(value);
+	size_t		src_size = utf16_size(value);
+	char	   *dst;
+	size_t		dst_size;
+
+	DECLARE_PG_STACK();
+
+	dst = pg_stack_alloc(char16_to_mb_max_size(src_size));
+	dst_size = char16_to_local(dst, src, src_size);
+
+	pq_begintypsend(&buf);
+	pq_sendtext(&buf, dst, dst_size);
+
+	pg_stack_free(dst);
+
+	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+}
+
+/* Codepoint order comparison. */
+static int
+char16_cmp(const storage_char16_t *data1, size_t size1,
+		   const storage_char16_t *data2, size_t size2)
+{
+	int			result = char16_cmp1(data1, data2, Min(size1, size2));
+
+	if (result == 0)
+		result = size1 < size2 ? -1 : size1 > size2 ? 1 : 0;
+	return result;
+}
+
+/*
+ * XXX It might be nice to use _Generic for type selection instead of funky
+ * name-pasting macros where you have to state the typenames explicitly, but
+ * utf16 and text are not distinct C types.
+ */
+
+/* Get an fmgr argument of type T. */
+#define GEN_GET_ARG__text								PG_GETARG_TEXT_PP
+#define GEN_GET_ARG__NameData 							PG_GETARG_NAME
+#define GEN_GET_ARG__utf16 								PG_GETARG_UTF16_PP
+#define GEN_GET_ARG(T, n) CppConcat2(GEN_GET_ARG__, T)(n)
+
+/* Get the character type for type T. */
+#define GEN_TYPEOF_CHAR__text							char
+#define GEN_TYPEOF_CHAR__NameData						char
+#define GEN_TYPEOF_CHAR__utf16							storage_char16_t
+#define GEN_TYPEOF_CHAR(T) CppConcat2(GEN_TYPEOF_CHAR__, T)
+
+/* Get a pointer to the basic characters from type T. */
+#define GEN_GET_DATA__text(v) 							VARDATA_ANY(v)
+#define GEN_GET_DATA__NameData(v)						NameStr(*(v))
+#define GEN_GET_DATA__utf16(v) 							utf16_data(v)
+#define GEN_GET_DATA(T, v) CppConcat2(GEN_GET_DATA__, T)(v)
+
+/* Get size in basic characters from T. */
+#define GEN_GET_SIZE__text(v) VARSIZE_ANY_EXHDR(v)
+#define GEN_GET_SIZE__NameData(v) strnlen(NameStr(*(v)), NAMEDATALEN)
+#define GEN_GET_SIZE__utf16(v) utf16_size(v)
+#define GEN_GET_SIZE(T, v) CppConcat2(GEN_GET_SIZE__, T)(v)
+
+/* Name mangling convention for pairs of character types. */
+#define GEN_FNAME__char__char							local
+#define GEN_FNAME__storage_char16_t__storage_char16_t	char16
+#define GEN_FNAME__char__storage_char16_t				local_char16
+#define GEN_FNAME__storage_char16_t__char				char16_local
+
+/* Name mangling convention for pairs of types. */
+#define GEN_FNAME__text__text							text
+#define GEN_FNAME__text__NameData						textname
+#define GEN_FNAME__text__utf16 							textutf16
+#define GEN_FNAME__NameData__text 						nametext
+#define GEN_FNAME__NameData__NameData					name
+#define GEN_FNAME__NameData__utf16						nameutf16
+#define GEN_FNAME__utf16__text							utf16text
+#define GEN_FNAME__utf16__NameData						utf16name
+#define GEN_FNAME__utf16__utf16							utf16
+
+/* Make function name for overload T1, T2 (prefix, suffix style). */
+#define GEN_FNAME_P(prefix, T1, T2)										\
+	CppConcat2(prefix, CppConcat2(GEN_FNAME__, T1##__##T2))
+#define GEN_FNAME_S(T1, T2, suffix)										\
+	CppConcat2(CppConcat2(GEN_FNAME__, T1##__##T2), suffix)
+
+/* Call function overloaded for T1, T2 (prefix, suffix style). */
+#define GEN_CALL_OVERLOAD_P(prefix, T1, T2, ...)						\
+	GEN_FNAME_P(prefix, T1, T2)(__VA_ARGS__)
+#define GEN_CALL_OVERLOAD_S(T1, T2, suffix, ...)						\
+	GEN_FNAME_S(T1, T2, suffix)(__VA_ARGS__)
+
+#define GEN_STRNCOLL(T1, T2)											\
+static int																\
+GEN_FNAME_S(T1, T2, _strncoll) (const T1 *data1, size_t size1,			\
+								const T2 *data2, size_t size2,			\
+								Oid collid)								\
+{																		\
+	pg_locale_t mylocale;												\
+	int			result;													\
+	check_collation_set(collid);										\
+	mylocale = pg_newlocale_from_collation(collid);						\
+	if (mylocale->collate_is_c)											\
+	{																	\
+		/* Codepoint order determines result. */						\
+		result = GEN_CALL_OVERLOAD_S(T1, T2, _cmp,						\
+									 data1, size1, data2, size2);		\
+	}																	\
+	else																\
+	{																	\
+		/* Locale determines results. */								\
+		result = GEN_CALL_OVERLOAD_P(pg_strncoll_, T1, T2,				\
+									 data1, size1, data2, size2,		\
+									 mylocale);							\
+		/* Codepoint order tie-breaker for derministic locales. */		\
+		if (result == 0 && mylocale->deterministic)						\
+			result = GEN_CALL_OVERLOAD_S(T1, T2, _cmp,					\
+										 data1, size1, data2, size2);	\
+	}																	\
+	return result;														\
+}
+
+/* Dispatch to correct overload of XXX_strncoll(). */
+#define GEN_CMP(T1, T2)													\
+Datum																	\
+GEN_FNAME_S(T1, T2, cmp)(PG_FUNCTION_ARGS)								\
+{																		\
+	T1 *arg1 = GEN_GET_ARG(T1, 0);										\
+	T2 *arg2 = GEN_GET_ARG(T2, 1);										\
+	Oid collid = PG_GET_COLLATION();									\
+	int result = GEN_CALL_OVERLOAD_S(GEN_TYPEOF_CHAR(T1),				\
+									 GEN_TYPEOF_CHAR(T2),				\
+									 _strncoll,							\
+									 GEN_GET_DATA(T1, arg1),			\
+									 GEN_GET_SIZE(T1, arg1),			\
+									 GEN_GET_DATA(T2, arg2),			\
+									 GEN_GET_SIZE(T2, arg2),			\
+									 collid);							\
+	PG_FREE_IF_COPY(arg1, 0);											\
+	PG_FREE_IF_COPY(arg2, 1);											\
+	PG_RETURN_INT32(result);											\
+}
+
+/* General case: dispatch to correct overload of XXX_strncoll(). */
+#define GEN_REL(T1, T2, suffix, op)										\
+Datum																	\
+GEN_FNAME_S(T1, T2, suffix)(PG_FUNCTION_ARGS)							\
+{																		\
+	T1 *arg1 = GEN_GET_ARG(T1, 0);										\
+	T2 *arg2 = GEN_GET_ARG(T2, 1);										\
+	Oid collid = PG_GET_COLLATION();									\
+	int result = GEN_CALL_OVERLOAD_S(GEN_TYPEOF_CHAR(T1),				\
+									 GEN_TYPEOF_CHAR(T2),				\
+									 _strncoll,							\
+									 GEN_GET_DATA(T1, arg1),			\
+									 GEN_GET_SIZE(T1, arg1),			\
+									 GEN_GET_DATA(T2, arg2),			\
+									 GEN_GET_SIZE(T2, arg2),			\
+									 collid);							\
+	PG_FREE_IF_COPY(arg1, 0);											\
+	PG_FREE_IF_COPY(arg2, 1);											\
+	PG_RETURN_BOOL(result op 0);										\
+}
+
+/*
+ * Special case for == and != and deterministic locales: bit level compare,
+ * skipping the locale system.  Otherwise, same as the above.
+ */
+#define GEN_EQ(T1, T2, suffix, op)										\
+Datum																	\
+GEN_FNAME_S(T1, T2, suffix)(PG_FUNCTION_ARGS)							\
+{																		\
+	Oid			collid = PG_GET_COLLATION();							\
+	pg_locale_t mylocale = 0;											\
+	static_assert(sizeof(GEN_TYPEOF_CHAR(T1)) !=						\
+				  sizeof(GEN_TYPEOF_CHAR(T2)),							\
+				  "should use optimized version for same char type");	\
+	check_collation_set(collid);										\
+	mylocale = pg_newlocale_from_collation(collid);						\
+	if (mylocale->deterministic)										\
+	{																	\
+		/* Codepoint order then length determine result. */				\
+		T1 *arg1 = GEN_GET_ARG(T1, 0);									\
+		T2 *arg2 = GEN_GET_ARG(T2, 1);									\
+		int result = GEN_CALL_OVERLOAD_S(GEN_TYPEOF_CHAR(T1),			\
+										 GEN_TYPEOF_CHAR(T2),			\
+										 _cmp,							\
+										 GEN_GET_DATA(T1, arg1),		\
+										 GEN_GET_SIZE(T1, arg1),		\
+										 GEN_GET_DATA(T2, arg2),		\
+										 GEN_GET_SIZE(T2, arg2));		\
+		PG_FREE_IF_COPY(arg1, 0);										\
+		PG_FREE_IF_COPY(arg2, 1);										\
+		PG_RETURN_BOOL(result op 0);									\
+	}																	\
+	else																\
+	{																	\
+		/* Locale determines result. */									\
+		T1 *arg1 = GEN_GET_ARG(T1, 0);									\
+		T2 *arg2 = GEN_GET_ARG(T2, 1);									\
+		int result = GEN_CALL_OVERLOAD_S(GEN_TYPEOF_CHAR(T1),			\
+										 GEN_TYPEOF_CHAR(T2),			\
+										 _strncoll,						\
+										 GEN_GET_DATA(T1, arg1),		\
+										 GEN_GET_SIZE(T1, arg1),		\
+										 GEN_GET_DATA(T2, arg2),		\
+										 GEN_GET_SIZE(T2, arg2),		\
+										 collid);						\
+		PG_FREE_IF_COPY(arg1, 0);										\
+		PG_FREE_IF_COPY(arg2, 1);										\
+		PG_RETURN_BOOL(result op 0);									\
+	}																	\
+}
+
+/*
+ * Special case for == and != and deterministic locales when the basic char
+ * type is the same: we can cheaply check the size first to determine that
+ * strings are NOT equal.
+ */
+#define GEN_EQ_SAME_CHAR(T1, T2, suffix, op)							\
+Datum																	\
+GEN_FNAME_S(T1, T2, suffix)(PG_FUNCTION_ARGS)							\
+{																		\
+	Oid			collid = PG_GET_COLLATION();							\
+	pg_locale_t mylocale = 0;											\
+	static_assert(sizeof(GEN_TYPEOF_CHAR(T1)) ==						\
+				  sizeof(GEN_TYPEOF_CHAR(T2)),							\
+				  "optimization requires same char");					\
+	check_collation_set(collid);										\
+	mylocale = pg_newlocale_from_collation(collid);						\
+	if (mylocale->deterministic)										\
+	{																	\
+		/* Optimization for same basic character type. */				\
+		if (toast_raw_datum_size(PG_GETARG_DATUM(0)) !=					\
+			toast_raw_datum_size(PG_GETARG_DATUM(1)))					\
+		{																\
+			/* Different size: fast result without detoasting. */		\
+			PG_RETURN_BOOL(false op true);								\
+		}																\
+		else															\
+		{																\
+			/* Same size: codepoint equality determines result. */		\
+			T1 *arg1 = PG_GETARG_UTF16_PP(0);							\
+			T2 *arg2 = PG_GETARG_UTF16_PP(1);							\
+			int result = GEN_CALL_OVERLOAD_S(GEN_TYPEOF_CHAR(T1),		\
+											 GEN_TYPEOF_CHAR(T2),		\
+											 _cmp1,						\
+											 GEN_GET_DATA(T1, arg1),	\
+											 GEN_GET_DATA(T2, arg2),	\
+											 GEN_GET_SIZE(T1, arg1));	\
+			Assert(GEN_GET_SIZE(T1, arg1) == GEN_GET_SIZE(T2, arg2));	\
+			PG_FREE_IF_COPY(arg1, 0);									\
+			PG_FREE_IF_COPY(arg2, 1);									\
+			PG_RETURN_BOOL(result op 0);								\
+		}																\
+	}																	\
+	else																\
+	{																	\
+		/* Locale determines result. */									\
+		T1 *arg1 = GEN_GET_ARG(T1, 0);									\
+		T2 *arg2 = GEN_GET_ARG(T2, 1);									\
+		int result = GEN_CALL_OVERLOAD_S(GEN_TYPEOF_CHAR(T1),			\
+										 GEN_TYPEOF_CHAR(T2),			\
+										 _strncoll,						\
+										 GEN_GET_DATA(T1, arg1),		\
+										 GEN_GET_SIZE(T1, arg1),		\
+										 GEN_GET_DATA(T2, arg2),		\
+										 GEN_GET_SIZE(T2, arg2),		\
+										 collid);						\
+		PG_FREE_IF_COPY(arg1, 0);										\
+		PG_FREE_IF_COPY(arg2, 1);										\
+		PG_RETURN_BOOL(result op 0);									\
+	}																	\
+}
+
+/* Make the workhorse character-based collation functions. */
+GEN_STRNCOLL(storage_char16_t, storage_char16_t);
+GEN_STRNCOLL(storage_char16_t, char);
+GEN_STRNCOLL(char, storage_char16_t);
+
+/* Make the registered procedures with non-detoasting optimization. */
+#define GEN_RELS_SAME_CHAR(T1, T2)										\
+	GEN_CMP(T1, T2);													\
+	GEN_EQ_SAME_CHAR(T1, T2, eq, ==);									\
+	GEN_EQ_SAME_CHAR(T1, T2, ne, !=);									\
+	GEN_REL(T1, T2, lt, <);												\
+	GEN_REL(T1, T2, le, <=);											\
+	GEN_REL(T1, T2, ge, >=);											\
+	GEN_REL(T1, T2, gt, >);
+
+/* Make the registered precedures without non-detoasting optimization. */
+#define GEN_RELS_DIFF_CHAR(T1, T2)										\
+	GEN_CMP(T1, T2);													\
+	GEN_EQ(T1, T2, eq, ==);												\
+	GEN_EQ(T1, T2, ne, !=);												\
+	GEN_REL(T1, T2, lt, <);												\
+	GEN_REL(T1, T2, le, <=);											\
+	GEN_REL(T1, T2, ge, >=);											\
+	GEN_REL(T1, T2, gt, >);
+
+/* For now, generate only type permutations with utf16 on one side. */
+GEN_RELS_DIFF_CHAR(utf16, text);
+GEN_RELS_DIFF_CHAR(utf16, NameData);
+GEN_RELS_DIFF_CHAR(text, utf16);
+GEN_RELS_DIFF_CHAR(NameData, utf16);
+GEN_RELS_SAME_CHAR(utf16, utf16);
diff --git a/src/include/c.h b/src/include/c.h
index 97ed8c63f5e..07766516bde 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -497,6 +497,8 @@ extern "C++"
  *		Convert the argument to a string, after one round of macro expansion.
  * CppConcat
  *		Concatenate two arguments together, using the C preprocessor.
+ * CppConcat2
+ *		Concatenate two arguments together, and apply macro expansion.
  *
  * Note: There used to be support here for pre-ANSI C compilers that didn't
  * support # and ##.  Nowadays, these macros are just for clarity and/or
@@ -505,6 +507,7 @@ extern "C++"
 #define CppAsString(identifier) #identifier
 #define CppAsString2(x)			CppAsString(x)
 #define CppConcat(x, y)			x##y
+#define CppConcat2(x, y)		CppConcat(x, y)
 
 /*
  * VA_ARGS_NARGS
@@ -787,6 +790,7 @@ typedef struct varlena
  */
 typedef varlena bytea;
 typedef varlena text;
+typedef varlena utf16;
 typedef varlena BpChar;			/* blank-padded char, ie SQL char(n) */
 typedef varlena VarChar;		/* var-length char, ie SQL varchar(n) */
 
@@ -1508,6 +1512,7 @@ typedef uint32_t char32_t;
 #endif
 #endif
 
+
 /* IWYU pragma: end_exports */
 
 #endif							/* C_H */
diff --git a/src/include/catalog/pg_amop.dat b/src/include/catalog/pg_amop.dat
index 8d5a0004a47..febb5f71f30 100644
--- a/src/include/catalog/pg_amop.dat
+++ b/src/include/catalog/pg_amop.dat
@@ -397,6 +397,85 @@
 { amopfamily => 'btree/text_ops', amoplefttype => 'text',
   amoprighttype => 'name', amopstrategy => '5', amopopr => '>(text,name)',
   amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'utf16', amopstrategy => '1', amopopr => '<(utf16,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'utf16', amopstrategy => '2', amopopr => '<=(utf16,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'utf16', amopstrategy => '3', amopopr => '=(utf16,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'utf16', amopstrategy => '4', amopopr => '>=(utf16,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'utf16', amopstrategy => '5', amopopr => '>(utf16,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'text', amopstrategy => '1', amopopr => '<(utf16,text)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'text', amopstrategy => '2', amopopr => '<=(utf16,text)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'text', amopstrategy => '3', amopopr => '=(utf16,text)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'text', amopstrategy => '4', amopopr => '>=(utf16,text)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'text', amopstrategy => '5', amopopr => '>(utf16,text)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'text',
+  amoprighttype => 'utf16', amopstrategy => '1', amopopr => '<(text,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'text',
+  amoprighttype => 'utf16', amopstrategy => '2', amopopr => '<=(text,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'text',
+  amoprighttype => 'utf16', amopstrategy => '3', amopopr => '=(text,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'text',
+  amoprighttype => 'utf16', amopstrategy => '4', amopopr => '>=(text,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'text',
+  amoprighttype => 'utf16', amopstrategy => '5', amopopr => '>(text,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'name', amopstrategy => '1', amopopr => '<(utf16,name)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'name', amopstrategy => '2', amopopr => '<=(utf16,name)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'name', amopstrategy => '3', amopopr => '=(utf16,name)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'name', amopstrategy => '4', amopopr => '>=(utf16,name)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'utf16',
+  amoprighttype => 'name', amopstrategy => '5', amopopr => '>(utf16,name)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'name',
+  amoprighttype => 'utf16', amopstrategy => '1', amopopr => '<(name,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'name',
+  amoprighttype => 'utf16', amopstrategy => '2', amopopr => '<=(name,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'name',
+  amoprighttype => 'utf16', amopstrategy => '3', amopopr => '=(name,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'name',
+  amoprighttype => 'utf16', amopstrategy => '4', amopopr => '>=(name,utf16)',
+  amopmethod => 'btree' },
+{ amopfamily => 'btree/text_ops', amoplefttype => 'name',
+  amoprighttype => 'utf16', amopstrategy => '5', amopopr => '>(name,utf16)',
+  amopmethod => 'btree' },
+
+
+
+
 
 # btree bpchar_ops
 
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 4a1efdbc899..c78891b3f77 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -232,6 +232,18 @@
   amprocrighttype => 'text', amprocnum => '2', amproc => 'bttextsortsupport' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '4', amproc => 'btvarstrequalimage' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'utf16',
+  amprocrighttype => 'utf16', amprocnum => '1', amproc => 'utf16cmp' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'utf16',
+  amprocrighttype => 'utf16', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'utf16',
+  amprocrighttype => 'text', amprocnum => '1', amproc => 'utf16textcmp' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'text',
+  amprocrighttype => 'utf16', amprocnum => '1', amproc => 'textutf16cmp' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'utf16',
+  amprocrighttype => 'name', amprocnum => '1', amproc => 'utf16namecmp' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'name',
+  amprocrighttype => 'utf16', amprocnum => '1', amproc => 'nameutf16cmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'name', amprocnum => '1', amproc => 'btnamecmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..b3dfabb6956 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -127,6 +127,8 @@
   opcintype => 'text' },
 { opcmethod => 'hash', opcname => 'text_ops', opcfamily => 'hash/text_ops',
   opcintype => 'text' },
+{ opcmethod => 'btree', opcname => 'utf16_ops', opcfamily => 'btree/text_ops',
+  opcintype => 'utf16' },
 { opcmethod => 'btree', opcname => 'time_ops', opcfamily => 'btree/time_ops',
   opcintype => 'time' },
 { opcmethod => 'hash', opcname => 'time_ops', opcfamily => 'hash/time_ops',
diff --git a/src/include/catalog/pg_operator.dat b/src/include/catalog/pg_operator.dat
index 1465f13120a..f8f07d74e8b 100644
--- a/src/include/catalog/pg_operator.dat
+++ b/src/include/catalog/pg_operator.dat
@@ -107,6 +107,166 @@
   oprcode => 'starts_with', oprrest => 'prefixsel',
   oprjoin => 'prefixjoinsel' },
 
+# text_utf6
+{ oid => '9820', descr => 'equal',
+  oprname => '=', oprcanmerge => 't', oprleft => 'utf16',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '=(utf16,utf16)',
+  oprnegate => '<>(utf16,utf16)', oprcode => 'utf16eq', oprrest => 'eqsel',
+  oprjoin => 'eqjoinsel' },
+{ oid => '9821', descr => 'not equal',
+  oprname => '<>', oprleft => 'utf16',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<>(utf16,utf16)',
+  oprnegate => '=(utf16,utf16)', oprcode => 'utf16ne', oprrest => 'neqsel',
+  oprjoin => 'neqjoinsel' },
+{ oid => '9825', descr => 'less than',
+  oprname => '<', oprleft => 'utf16',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '>(utf16,utf16)',
+  oprnegate => '>=(utf16,utf16)', oprcode => 'utf16lt', oprrest => 'scalarltsel',
+  oprjoin => 'scalarltjoinsel' },
+{ oid => '9826', descr => 'less than or equal',
+  oprname => '<=', oprleft => 'utf16',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '>=(utf16,utf16)',
+  oprnegate => '>(utf16,utf16)', oprcode => 'utf16le', oprrest => 'scalarlesel',
+  oprjoin => 'scalarlejoinsel' },
+{ oid => '9827', descr => 'greater than',
+  oprname => '>', oprleft => 'utf16',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<(utf16,utf16)',
+  oprnegate => '<=(utf16,utf16)', oprcode => 'utf16gt', oprrest => 'scalargtsel',
+  oprjoin => 'scalargtjoinsel' },
+{ oid => '9828', descr => 'greater than or equal',
+  oprname => '>=', oprleft => 'utf16',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<=(utf16,utf16)',
+  oprnegate => '<(utf16,utf16)', oprcode => 'utf16ge', oprrest => 'scalargesel',
+  oprjoin => 'scalargejoinsel' },
+
+# utf16, text
+{ oid => '9860', descr => 'equal',
+  oprname => '=', oprcanmerge => 't', oprleft => 'utf16',
+  oprright => 'text', oprresult => 'bool', oprcom => '=(text,utf16)',
+  oprnegate => '<>(utf16,text)', oprcode => 'utf16texteq', oprrest => 'eqsel',
+  oprjoin => 'eqjoinsel' },
+{ oid => '9861', descr => 'not equal',
+  oprname => '<>', oprleft => 'utf16',
+  oprright => 'text', oprresult => 'bool', oprcom => '<>(text,utf16)',
+  oprnegate => '=(utf16,text)', oprcode => 'utf16textne', oprrest => 'neqsel',
+  oprjoin => 'neqjoinsel' },
+{ oid => '9865', descr => 'less than',
+  oprname => '<', oprleft => 'utf16',
+  oprright => 'text', oprresult => 'bool', oprcom => '>(text,utf16)',
+  oprnegate => '>=(utf16,text)', oprcode => 'utf16textlt', oprrest => 'scalarltsel',
+  oprjoin => 'scalarltjoinsel' },
+{ oid => '9866', descr => 'less than or equal',
+  oprname => '<=', oprleft => 'utf16',
+  oprright => 'text', oprresult => 'bool', oprcom => '>=(text,utf16)',
+  oprnegate => '>(utf16,text)', oprcode => 'utf16textle', oprrest => 'scalarlesel',
+  oprjoin => 'scalarlejoinsel' },
+{ oid => '9867', descr => 'greater than',
+  oprname => '>', oprleft => 'utf16',
+  oprright => 'text', oprresult => 'bool', oprcom => '<(text,utf16)',
+  oprnegate => '<=(utf16,text)', oprcode => 'utf16textgt', oprrest => 'scalargtsel',
+  oprjoin => 'scalargtjoinsel' },
+{ oid => '9868', descr => 'greater than or equal',
+  oprname => '>=', oprleft => 'utf16',
+  oprright => 'text', oprresult => 'bool', oprcom => '<=(text,utf16)',
+  oprnegate => '<(utf16,text)', oprcode => 'utf16textge', oprrest => 'scalargesel',
+  oprjoin => 'scalargejoinsel' },
+
+# text, utf16
+{ oid => '9870', descr => 'equal',
+  oprname => '=', oprcanmerge => 't', oprleft => 'text',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '=(utf16,text)',
+  oprnegate => '<>(text,utf16)', oprcode => 'textutf16eq', oprrest => 'eqsel',
+  oprjoin => 'eqjoinsel' },
+{ oid => '9871', descr => 'not equal',
+  oprname => '<>', oprleft => 'text',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<>(utf16,text)',
+  oprnegate => '=(text,utf16)', oprcode => 'textutf16ne', oprrest => 'neqsel',
+  oprjoin => 'neqjoinsel' },
+{ oid => '9875', descr => 'less than',
+  oprname => '<', oprleft => 'text',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '>(utf16,text)',
+  oprnegate => '>=(text,utf16)', oprcode => 'textutf16lt', oprrest => 'scalarltsel',
+  oprjoin => 'scalarltjoinsel' },
+{ oid => '9876', descr => 'less than or equal',
+  oprname => '<=', oprleft => 'text',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '>=(utf16,text)',
+  oprnegate => '>(text,utf16)', oprcode => 'textutf16le', oprrest => 'scalarlesel',
+  oprjoin => 'scalarlejoinsel' },
+{ oid => '9877', descr => 'greater than',
+  oprname => '>', oprleft => 'text',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<(utf16,text)',
+  oprnegate => '<=(text,utf16)', oprcode => 'textutf16gt', oprrest => 'scalargtsel',
+  oprjoin => 'scalargtjoinsel' },
+{ oid => '9878', descr => 'greater than or equal',
+  oprname => '>=', oprleft => 'text',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<=(utf16,text)',
+  oprnegate => '<(text,utf16)', oprcode => 'textutf16ge', oprrest => 'scalargesel',
+  oprjoin => 'scalargejoinsel' },
+
+# utf16, name
+{ oid => '9460', descr => 'equal',
+  oprname => '=', oprcanmerge => 't', oprleft => 'utf16',
+  oprright => 'name', oprresult => 'bool', oprcom => '=(name,utf16)',
+  oprnegate => '<>(utf16,name)', oprcode => 'utf16nameeq', oprrest => 'eqsel',
+  oprjoin => 'eqjoinsel' },
+{ oid => '9461', descr => 'not equal',
+  oprname => '<>', oprleft => 'utf16',
+  oprright => 'name', oprresult => 'bool', oprcom => '<>(name,utf16)',
+  oprnegate => '=(utf16,name)', oprcode => 'utf16namene', oprrest => 'neqsel',
+  oprjoin => 'neqjoinsel' },
+{ oid => '9465', descr => 'less than',
+  oprname => '<', oprleft => 'utf16',
+  oprright => 'name', oprresult => 'bool', oprcom => '>(name,utf16)',
+  oprnegate => '>=(utf16,name)', oprcode => 'utf16namelt', oprrest => 'scalarltsel',
+  oprjoin => 'scalarltjoinsel' },
+{ oid => '9466', descr => 'less than or equal',
+  oprname => '<=', oprleft => 'utf16',
+  oprright => 'name', oprresult => 'bool', oprcom => '>=(name,utf16)',
+  oprnegate => '>(utf16,name)', oprcode => 'utf16namele', oprrest => 'scalarlesel',
+  oprjoin => 'scalarlejoinsel' },
+{ oid => '9467', descr => 'greater than',
+  oprname => '>', oprleft => 'utf16',
+  oprright => 'name', oprresult => 'bool', oprcom => '<(name,utf16)',
+  oprnegate => '<=(utf16,name)', oprcode => 'utf16namegt', oprrest => 'scalargtsel',
+  oprjoin => 'scalargtjoinsel' },
+{ oid => '9468', descr => 'greater than or equal',
+  oprname => '>=', oprleft => 'utf16',
+  oprright => 'name', oprresult => 'bool', oprcom => '<=(name,utf16)',
+  oprnegate => '<(utf16,name)', oprcode => 'utf16namege', oprrest => 'scalargesel',
+  oprjoin => 'scalargejoinsel' },
+
+# name, utf16
+{ oid => '9470', descr => 'equal',
+  oprname => '=', oprcanmerge => 't', oprleft => 'name',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '=(utf16,name)',
+  oprnegate => '<>(name,utf16)', oprcode => 'nameutf16eq', oprrest => 'eqsel',
+  oprjoin => 'eqjoinsel' },
+{ oid => '9471', descr => 'not equal',
+  oprname => '<>', oprleft => 'name',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<>(utf16,name)',
+  oprnegate => '=(name,utf16)', oprcode => 'nameutf16ne', oprrest => 'neqsel',
+  oprjoin => 'neqjoinsel' },
+{ oid => '9475', descr => 'less than',
+  oprname => '<', oprleft => 'name',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '>(utf16,name)',
+  oprnegate => '>=(name,utf16)', oprcode => 'nameutf16lt', oprrest => 'scalarltsel',
+  oprjoin => 'scalarltjoinsel' },
+{ oid => '9476', descr => 'less than or equal',
+  oprname => '<=', oprleft => 'name',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '>=(utf16,name)',
+  oprnegate => '>(name,utf16)', oprcode => 'nameutf16le', oprrest => 'scalarlesel',
+  oprjoin => 'scalarlejoinsel' },
+{ oid => '9477', descr => 'greater than',
+  oprname => '>', oprleft => 'name',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<(utf16,name)',
+  oprnegate => '<=(name,utf16)', oprcode => 'nameutf16gt', oprrest => 'scalargtsel',
+  oprjoin => 'scalargtjoinsel' },
+{ oid => '9478', descr => 'greater than or equal',
+  oprname => '>=', oprleft => 'name',
+  oprright => 'utf16', oprresult => 'bool', oprcom => '<=(utf16,name)',
+  oprnegate => '<(name,utf16)', oprcode => 'nameutf16ge', oprrest => 'scalargesel',
+  oprjoin => 'scalargejoinsel' },
+
 { oid => '254', oid_symbol => 'NameEqualTextOperator', descr => 'equal',
   oprname => '=', oprcanmerge => 't', oprcanhash => 't', oprleft => 'name',
   oprright => 'text', oprresult => 'bool', oprcom => '=(text,name)',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fa9ae79082b..97019deca7e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -97,9 +97,17 @@
 { oid => '46', descr => 'I/O',
   proname => 'textin', prorettype => 'text', proargtypes => 'cstring',
   prosrc => 'textin' },
+{ oid => '8890', descr => 'I/O',
+  proname => 'utf16in', prorettype => 'utf16',
+  proargtypes => 'cstring',
+  prosrc => 'utf16in' },
 { oid => '47', descr => 'I/O',
   proname => 'textout', prorettype => 'cstring', proargtypes => 'text',
   prosrc => 'textout' },
+{ oid => '8891', descr => 'I/O',
+  proname => 'utf16out', prorettype => 'cstring',
+  proargtypes => 'utf16',
+  prosrc => 'utf16out' },
 { oid => '48', descr => 'I/O',
   proname => 'tidin', prorettype => 'tid', proargtypes => 'cstring',
   prosrc => 'tidin' },
@@ -264,6 +272,116 @@
   proname => 'version', provolatile => 's', prorettype => 'text',
   proargtypes => '', prosrc => 'pgsql_version' },
 
+{ oid => '8935', descr => 'less-equal-greater',
+  proname => 'utf16cmp', prorettype => 'int4',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16cmp' },
+{ oid => '8901',
+  proname => 'utf16le', prorettype => 'bool',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16le' },
+{ oid => '8902',
+  proname => 'utf16lt', prorettype => 'bool',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16lt' },
+{ oid => '8907',
+  proname => 'utf16gt', prorettype => 'bool',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16gt' },
+{ oid => '8904',
+  proname => 'utf16ge', prorettype => 'bool',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16ge' },
+{ oid => '8900',
+  proname => 'utf16eq', prorettype => 'bool',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16eq' },
+{ oid => '8905',
+  proname => 'utf16ne', prorettype => 'bool',
+  proargtypes => 'utf16 utf16', prosrc => 'utf16ne' },
+
+{ oid => '8972', descr => 'less-equal-greater',
+  proname => 'textutf16cmp', prorettype => 'int4',
+  proargtypes => 'text utf16', prosrc => 'textutf16cmp' },
+{ oid => '8941',
+  proname => 'textutf16le', prorettype => 'bool',
+  proargtypes => 'text utf16', prosrc => 'textutf16le' },
+{ oid => '8942',
+  proname => 'textutf16lt', prorettype => 'bool',
+  proargtypes => 'text utf16', prosrc => 'textutf16lt' },
+{ oid => '8947',
+  proname => 'textutf16gt', prorettype => 'bool',
+  proargtypes => 'text utf16', prosrc => 'textutf16gt' },
+{ oid => '8944',
+  proname => 'textutf16ge', prorettype => 'bool',
+  proargtypes => 'text utf16', prosrc => 'textutf16ge' },
+{ oid => '8970',
+  proname => 'textutf16eq', prorettype => 'bool',
+  proargtypes => 'text utf16', prosrc => 'textutf16eq' },
+{ oid => '8971',
+  proname => 'textutf16ne', prorettype => 'bool',
+  proargtypes => 'text utf16', prosrc => 'textutf16ne' },
+
+{ oid => '8932', descr => 'less-equal-greater',
+  proname => 'utf16textcmp', prorettype => 'int4',
+  proargtypes => 'utf16 text', prosrc => 'utf16textcmp' },
+{ oid => '8951',
+  proname => 'utf16textle', prorettype => 'bool',
+  proargtypes => 'utf16 text', prosrc => 'utf16textle' },
+{ oid => '8952',
+  proname => 'utf16textlt', prorettype => 'bool',
+  proargtypes => 'utf16 text', prosrc => 'utf16textlt' },
+{ oid => '8957',
+  proname => 'utf16textgt', prorettype => 'bool',
+  proargtypes => 'utf16 text', prosrc => 'utf16textgt' },
+{ oid => '8954',
+  proname => 'utf16textge', prorettype => 'bool',
+  proargtypes => 'utf16 text', prosrc => 'utf16textge' },
+{ oid => '8980',
+  proname => 'utf16texteq', prorettype => 'bool',
+  proargtypes => 'utf16 text', prosrc => 'utf16texteq' },
+{ oid => '8981',
+  proname => 'utf16textne', prorettype => 'bool',
+  proargtypes => 'utf16 text', prosrc => 'utf16textne' },
+
+{ oid => '8672', descr => 'less-equal-greater',
+  proname => 'nameutf16cmp', prorettype => 'int4',
+  proargtypes => 'name utf16', prosrc => 'nameutf16cmp' },
+{ oid => '8641',
+  proname => 'nameutf16le', prorettype => 'bool',
+  proargtypes => 'name utf16', prosrc => 'nameutf16le' },
+{ oid => '8642',
+  proname => 'nameutf16lt', prorettype => 'bool',
+  proargtypes => 'name utf16', prosrc => 'nameutf16lt' },
+{ oid => '8647',
+  proname => 'nameutf16gt', prorettype => 'bool',
+  proargtypes => 'name utf16', prosrc => 'nameutf16gt' },
+{ oid => '8644',
+  proname => 'nameutf16ge', prorettype => 'bool',
+  proargtypes => 'name utf16', prosrc => 'nameutf16ge' },
+{ oid => '8670',
+  proname => 'nameutf16eq', prorettype => 'bool',
+  proargtypes => 'name utf16', prosrc => 'nameutf16eq' },
+{ oid => '8671',
+  proname => 'nameutf16ne', prorettype => 'bool',
+  proargtypes => 'name utf16', prosrc => 'nameutf16ne' },
+
+{ oid => '8472', descr => 'less-equal-greater',
+  proname => 'utf16namecmp', prorettype => 'int4',
+  proargtypes => 'utf16 name', prosrc => 'utf16namecmp' },
+{ oid => '8451',
+  proname => 'utf16namele', prorettype => 'bool',
+  proargtypes => 'utf16 name', prosrc => 'utf16namele' },
+{ oid => '8452',
+  proname => 'utf16namelt', prorettype => 'bool',
+  proargtypes => 'utf16 name', prosrc => 'utf16namelt' },
+{ oid => '8457',
+  proname => 'utf16namegt', prorettype => 'bool',
+  proargtypes => 'utf16 name', prosrc => 'utf16namegt' },
+{ oid => '8454',
+  proname => 'utf16namege', prorettype => 'bool',
+  proargtypes => 'utf16 name', prosrc => 'utf16namege' },
+{ oid => '8470',
+  proname => 'utf16nameeq', prorettype => 'bool',
+  proargtypes => 'utf16 name', prosrc => 'utf16nameeq' },
+{ oid => '8471',
+  proname => 'utf16namene', prorettype => 'bool',
+  proargtypes => 'utf16 name', prosrc => 'utf16namene' },
+
 { oid => '86', descr => 'I/O',
   proname => 'pg_ddl_command_in', prorettype => 'pg_ddl_command',
   proargtypes => 'cstring', prosrc => 'pg_ddl_command_in' },
@@ -2875,6 +2993,9 @@
 { oid => '1317', descr => 'length',
   proname => 'length', prorettype => 'int4', proargtypes => 'text',
   prosrc => 'textlen' },
+{ oid => '8861', descr => 'length',
+  proname => 'length', prorettype => 'int4', proargtypes => 'utf16',
+  prosrc => 'utf16_length' },
 { oid => '1318', descr => 'character length',
   proname => 'length', prorettype => 'int4', proargtypes => 'bpchar',
   prosrc => 'bpcharlen' },
@@ -2992,6 +3113,9 @@
 { oid => '1375', descr => 'octet length',
   proname => 'octet_length', prorettype => 'int4', proargtypes => 'bpchar',
   prosrc => 'bpcharoctetlen' },
+{ oid => '8860', descr => 'octet length',
+  proname => 'octet_length', prorettype => 'int4', proargtypes => 'utf16',
+  prosrc => 'utf16_octet_length' },
 
 { oid => '1377', descr => 'larger of two',
   proname => 'time_larger', prorettype => 'time', proargtypes => 'time time',
@@ -8303,9 +8427,15 @@
 { oid => '2414', descr => 'I/O',
   proname => 'textrecv', provolatile => 's', prorettype => 'text',
   proargtypes => 'internal', prosrc => 'textrecv' },
+{ oid => '8892', descr => 'I/O',
+  proname => 'utf16recv', provolatile => 's', prorettype => 'utf16',
+  proargtypes => 'internal', prosrc => 'utf16recv' },
 { oid => '2415', descr => 'I/O',
   proname => 'textsend', provolatile => 's', prorettype => 'bytea',
   proargtypes => 'text', prosrc => 'textsend' },
+{ oid => '8893', descr => 'I/O',
+  proname => 'utf16send', provolatile => 's', prorettype => 'bytea',
+  proargtypes => 'utf16', prosrc => 'utf16send' },
 { oid => '2416', descr => 'I/O',
   proname => 'unknownrecv', prorettype => 'unknown', proargtypes => 'internal',
   prosrc => 'unknownrecv' },
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index a1a753d1797..130dce0adf6 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/include/catalog/pg_type.dat
@@ -84,6 +84,14 @@
   typispreferred => 't', typinput => 'textin', typoutput => 'textout',
   typreceive => 'textrecv', typsend => 'textsend', typalign => 'i',
   typstorage => 'x', typcollation => 'default' },
+{ oid => '8888', array_type_oid => '8889',
+  descr => 'variable-length UTF-16 string',
+  typname => 'utf16', typlen => '-1', typbyval => 'f', typcategory => 'S',
+  typispreferred => 'f',
+  typinput => 'utf16in', typoutput => 'utf16out',
+  typreceive => 'utf16recv', typsend => 'utf16send',
+  typalign => 'i',
+  typstorage => 'x', typcollation => 'default' },
 { oid => '26', array_type_oid => '1028',
   descr => 'object identifier(oid), maximum 4 billion',
   typname => 'oid', typlen => '4', typbyval => 't', typcategory => 'N',
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 10d02bdb79f..75a86af4baa 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -291,6 +291,7 @@ extern varlena *pg_detoast_datum_packed(varlena *datum);
 /* DatumGetFoo macros for varlena types will typically look like this: */
 #define DatumGetByteaPP(X)			((bytea *) PG_DETOAST_DATUM_PACKED(X))
 #define DatumGetTextPP(X)			((text *) PG_DETOAST_DATUM_PACKED(X))
+#define DatumGetUtf16PP(X)			((utf16 *) PG_DETOAST_DATUM_PACKED(X))
 #define DatumGetBpCharPP(X)			((BpChar *) PG_DETOAST_DATUM_PACKED(X))
 #define DatumGetVarCharPP(X)		((VarChar *) PG_DETOAST_DATUM_PACKED(X))
 #define DatumGetHeapTupleHeader(X)	((HeapTupleHeader) PG_DETOAST_DATUM(X))
@@ -308,6 +309,7 @@ extern varlena *pg_detoast_datum_packed(varlena *datum);
 /* GETARG macros for varlena types will typically look like this: */
 #define PG_GETARG_BYTEA_PP(n)		DatumGetByteaPP(PG_GETARG_DATUM(n))
 #define PG_GETARG_TEXT_PP(n)		DatumGetTextPP(PG_GETARG_DATUM(n))
+#define PG_GETARG_UTF16_PP(n)		DatumGetUtf16PP(PG_GETARG_DATUM(n))
 #define PG_GETARG_BPCHAR_PP(n)		DatumGetBpCharPP(PG_GETARG_DATUM(n))
 #define PG_GETARG_VARCHAR_PP(n)		DatumGetVarCharPP(PG_GETARG_DATUM(n))
 #define PG_GETARG_HEAPTUPLEHEADER(n)	DatumGetHeapTupleHeader(PG_GETARG_DATUM(n))
@@ -372,6 +374,7 @@ extern varlena *pg_detoast_datum_packed(varlena *datum);
 /* RETURN macros for other pass-by-ref types will typically look like this: */
 #define PG_RETURN_BYTEA_P(x)   PG_RETURN_POINTER(x)
 #define PG_RETURN_TEXT_P(x)    PG_RETURN_POINTER(x)
+#define PG_RETURN_UTF16_P(x)   PG_RETURN_POINTER(x)
 #define PG_RETURN_BPCHAR_P(x)  PG_RETURN_POINTER(x)
 #define PG_RETURN_VARCHAR_P(x) PG_RETURN_POINTER(x)
 #define PG_RETURN_HEAPTUPLEHEADER(x)  return HeapTupleHeaderGetDatum(x)
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index deee2a832c3..f7e8179ad45 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -367,6 +367,7 @@ typedef uint32 (*utf_local_conversion_func) (uint32 code);
 
 /*
  * Some handy functions for Unicode-specific tests.
+ * XXX Move these to unicode.h?
  */
 static inline bool
 is_valid_unicode_codepoint(char32_t c)
@@ -392,6 +393,21 @@ surrogate_pair_to_codepoint(char16_t first, char16_t second)
 	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
 }
 
+static inline bool
+codepoint_has_surrogate_pair(char32_t c)
+{
+	return c >= 0x10000;
+}
+
+static inline void
+codepoint_to_surrogate_pair(char16_t *first, char16_t *second, char32_t c)
+{
+	Assert(codepoint_has_surrogate_pair(c));
+	c -= 0x10000;
+	*first = (c >> 10) + 0xD800;
+	*second = (c & 0x3FF) + 0xDC00;
+}
+
 /*
  * Convert a UTF-8 character to a Unicode code point.
  * This is a one-character version of pg_utf2wchar_with_len.
@@ -469,6 +485,24 @@ unicode_utf8len(char32_t c)
 		return 4;
 }
 
+/*
+ * Number of bytes in a UTF8 sequence, based on the first byte.
+ */
+static inline int
+utf8_len_from_lead_byte(unsigned char c)
+{
+	if ((c & 0x80) == 0)
+		return 1;
+	else if ((c & 0xe0) == 0xc0)
+		return 2;
+	else if ((c & 0xf0) == 0xe0)
+		return 3;
+	else if ((c & 0xf8) == 0xf0)
+		return 4;
+	else
+		return 1;
+}
+
 /*
  * The functions in this list are exported by libpq, and we need to be sure
  * that we know which calls are satisfied by libpq and which are satisfied
diff --git a/src/include/mb/string_iterator.h b/src/include/mb/string_iterator.h
new file mode 100644
index 00000000000..1a1bbce879f
--- /dev/null
+++ b/src/include/mb/string_iterator.h
@@ -0,0 +1,500 @@
+/*-------------------------------------------------------------------------
+ *
+ * string_iterator.h
+ *	  Tools for iterating over text strings as char, char16_t or char32_t.
+ *
+ * Support for UTF-16 and UTF-32 is degraded if the database encoding is not
+ * UTF8: only the 8-bit subset (LATIN1) or 7-bit subset of Unicode that can be
+ * cast directly is supported, and out-of-range codepoints raise errors.
+ *
+ * XXX Data provision via callbacks could be investigated as a way to support
+ * incremental or deferred detoasting with centralized infrastructure, to
+ * avoid the need to open-code detoasting optimizations at every site.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/mb/string_iterator.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STRING_ITERATOR_H
+#define STRING_ITERATOR_H
+
+#include "mb/pg_wchar.h"
+#include "mb/unicode_types.h"
+
+/* A device for iterating over multibyte strings. */
+typedef struct mb_iterator
+{
+	const char *p;
+	const char *end;
+	int			encoding;
+	char16_t	surrogate;
+} mb_iterator;
+
+/* An iterator for iterating over UTF-16 strings. */
+typedef struct char16_iterator
+{
+	const storage_char16_t *p;
+	const storage_char16_t *end;
+} char16_iterator;
+
+/* Static initializers for the above. */
+#define MB_ITERATOR_INIT(data, size, encoding) \
+	{(data), (data) + (size), (encoding)}
+#define MB_ITERATOR_INIT_LOCAL(data, size) \
+	{(data), (data) + (size), GetDatabaseEncoding()}
+#define CHAR16_ITERATOR_INIT(data, size) {(data), (data) + (size)}
+
+/* Function pointer types useful for creating specializations. */
+typedef char32_t (*mb_iterator_next_char32_t_fn) (mb_iterator *);
+typedef size_t (*mb_iterator_store_char16_fn) (mb_iterator *, storage_char16_t *);
+typedef char32_t (*char16_iterator_next_char32_t_fn) (char16_iterator *);
+typedef size_t (*char16_iterator_store_mb_fn) (char16_iterator *, char *, int encoding);
+
+static inline void
+mb_iterator_begin(mb_iterator *iterator, const char *data, size_t size)
+{
+	iterator->p = data;
+	iterator->end = data + size;
+	iterator->surrogate = 0;
+}
+
+static inline bool
+mb_iterator_has_more(mb_iterator *iterator)
+{
+	return iterator->p < iterator->end || iterator->surrogate;
+}
+
+static inline void
+mb_iterator_check_ascii_range(mb_iterator *iterator,
+							  const char *target_encoding,
+							  unsigned char c)
+{
+	if (unlikely(c > 0x7f))
+		elog(ERROR,
+			 "no conversion from \"%s\" to \"%s\" is available for the sequence beginning 0x%02x",
+			 pg_encoding_to_char(iterator->encoding),
+			 target_encoding,
+			 c);
+}
+
+/* Store a character at dst and return byte count. */
+static inline size_t
+mb_iterator_store(mb_iterator *iterator, char *dst)
+{
+	size_t		size;
+
+	Assert(mb_iterator_has_more(iterator));
+	size = pg_mblen_range(iterator->p, iterator->end);
+	memcpy(dst, iterator->p, size);
+	iterator->p += size;
+	return size;
+}
+
+static inline size_t
+mb_iterator_store__sb(mb_iterator *iterator, char *dst)
+{
+	Assert(pg_encoding_max_length(iterator->encoding) == 1);
+	Assert(mb_iterator_has_more(iterator));
+	*dst = *iterator->p++;
+	return 1;
+}
+
+/* Store a character in memory before dst and return byte count. */
+static inline size_t
+mb_iterator_store_before(mb_iterator *iterator, char *dst)
+{
+	size_t		size;
+
+	Assert(mb_iterator_has_more(iterator));
+	size = pg_mblen_range(iterator->p, iterator->end);
+	dst -= size;
+	memcpy(dst, iterator->p, size);
+	iterator->p += size;
+	return size;
+}
+
+static inline size_t
+mb_iterator_store_before__sb(mb_iterator *iterator, char *dst)
+{
+	Assert(pg_encoding_max_length(iterator->encoding) == 1);
+	Assert(mb_iterator_has_more(iterator));
+	dst--;
+	*dst = *iterator->p++;
+	return 1;
+}
+
+static inline char32_t
+mb_iterator_next_char32_t__ascii(mb_iterator *iterator)
+{
+	unsigned char c;
+
+	Assert(mb_iterator_has_more(iterator));
+
+	/* ASCII can be cast directly to char32_t, after 7-bit range check. */
+	c = *iterator->p++;
+	mb_iterator_check_ascii_range(iterator, "UTF-32", c);
+	return c;
+}
+
+static inline char32_t
+mb_iterator_next_char32_t__latin1(mb_iterator *iterator)
+{
+	Assert(iterator->encoding == PG_LATIN1);
+	Assert(mb_iterator_has_more(iterator));
+
+	/* LATIN1 (unsigned) can be cast directly to char32_t. */
+	return (unsigned char) *iterator->p++;
+}
+
+static inline char32_t
+mb_iterator_next_char32_t__utf8(mb_iterator *iterator)
+{
+	const char *p = iterator->p;
+	size_t		size;
+
+	Assert(iterator->encoding == PG_UTF8);
+	Assert(mb_iterator_has_more(iterator));
+
+	size = utf8_len_from_lead_byte(*p);
+	if (p + size > iterator->end)
+		report_invalid_encoding(iterator->encoding, p, iterator->end - p);
+	iterator->p += size;
+
+	return utf8_to_unicode((unsigned char *) p);
+}
+
+static inline char32_t
+mb_iterator_next_char32_t(mb_iterator *iterator)
+{
+	switch (iterator->encoding)
+	{
+		case PG_UTF8:
+			return mb_iterator_next_char32_t__utf8(iterator);
+		case PG_LATIN1:
+			return mb_iterator_next_char32_t__latin1(iterator);
+		default:
+			return mb_iterator_next_char32_t__ascii(iterator);
+	}
+}
+
+static inline char16_t
+mb_iterator_next_char16_t__ascii(mb_iterator *iterator)
+{
+	unsigned char c;
+
+	Assert(mb_iterator_has_more(iterator));
+
+	/* ASCII can be cast directly to char16_t after 7-bit range check. */
+	c = *iterator->p++;
+	mb_iterator_check_ascii_range(iterator, "UTF-16", c);
+	return *iterator->p++;
+}
+
+static inline char16_t
+mb_iterator_next_char16_t__latin1(mb_iterator *iterator)
+{
+	unsigned char c;
+
+	Assert(mb_iterator_has_more(iterator));
+	c = *iterator->p++;
+
+	/* LATIN1 (unsigned) can be cast directly to char16_t. */
+	return c;
+}
+
+static inline char16_t
+mb_iterator_next_char16_t__utf8(mb_iterator *iterator)
+{
+	char32_t	codepoint;
+
+	Assert(iterator->encoding == PG_UTF8);
+	Assert(mb_iterator_has_more(iterator));
+
+	if (unlikely(iterator->surrogate))
+	{
+		char16_t	result = iterator->surrogate;
+
+		iterator->surrogate = 0;
+		return result;
+	}
+
+	codepoint = mb_iterator_next_char32_t__utf8(iterator);
+	if (unlikely(codepoint_has_surrogate_pair(codepoint)))
+	{
+		char16_t	result;
+
+		codepoint_to_surrogate_pair(&result, &iterator->surrogate, codepoint);
+		return result;
+	}
+
+	return codepoint;
+}
+
+static inline char16_t
+mb_iterator_next_char16_t(mb_iterator *iterator)
+{
+	switch (iterator->encoding)
+	{
+		case PG_UTF8:
+			return mb_iterator_next_char16_t__utf8(iterator);
+		case PG_LATIN1:
+			return mb_iterator_next_char16_t__latin1(iterator);
+		default:
+			return mb_iterator_next_char16_t__ascii(iterator);
+	}
+}
+
+static inline size_t
+mb_iterator_store_char16__ascii(mb_iterator *iterator, storage_char16_t *dst)
+{
+	unsigned char c;
+
+	Assert(mb_iterator_has_more(iterator));
+
+	/* ASCII can be cast directly to char16_t after 7-bit range check. */
+	c = *iterator->p++;
+	mb_iterator_check_ascii_range(iterator, "UTF-16", c);
+	char16_store(dst, c);
+	return 1;
+}
+
+static inline size_t
+mb_iterator_store_char16__latin1(mb_iterator *iterator, storage_char16_t *dst)
+{
+	unsigned char c;
+
+	Assert(iterator->encoding == PG_LATIN1);
+	Assert(mb_iterator_has_more(iterator));
+
+	/* LATIN1 (unsigned) can be cast directly to char16_t. */
+	c = *iterator->p++;
+	char16_store(dst, c);
+	return 1;
+}
+
+static inline size_t
+mb_iterator_store_char16__utf8(mb_iterator *iterator, storage_char16_t *dst)
+{
+	char32_t	codepoint;
+
+	Assert(iterator->encoding == PG_UTF8);
+	Assert(mb_iterator_has_more(iterator));
+
+	codepoint = mb_iterator_next_char32_t__utf8(iterator);
+	if (unlikely(codepoint_has_surrogate_pair(codepoint)))
+	{
+		char16_t	codepoint1;
+		char16_t	codepoint2;
+
+		codepoint_to_surrogate_pair(&codepoint1, &codepoint2, codepoint);
+		char16_store(&dst[0], codepoint1);
+		char16_store(&dst[1], codepoint2);
+		return 2;
+	}
+
+	char16_store(dst, codepoint);
+	return 1;
+}
+
+static inline char16_t
+mb_iterator_store_char16(mb_iterator *iterator, storage_char16_t *dst)
+{
+	switch (iterator->encoding)
+	{
+		case PG_UTF8:
+			return mb_iterator_store_char16__utf8(iterator, dst);
+		case PG_LATIN1:
+			return mb_iterator_store_char16__latin1(iterator, dst);
+		default:
+			return mb_iterator_store_char16__ascii(iterator, dst);
+	}
+}
+
+static inline void
+char16_iterator_begin(char16_iterator *iterator,
+					  const storage_char16_t *data,
+					  size_t size)
+{
+	iterator->p = data;
+	iterator->end = data + size;
+}
+
+static inline bool
+char16_iterator_has_more(char16_iterator *iterator)
+{
+	return iterator->p < iterator->end;
+}
+
+static inline void
+char16_iterator_report_short_pair(char16_t codepoint1)
+{
+	elog(ERROR, "invalid UTF-16 sequence 0x%04x", codepoint1);
+}
+
+static inline void
+char16_iterator_report_bad_pair(char16_t codepoint1, char16_t codepoint2)
+{
+	elog(ERROR, "invalid UTF-16 sequence 0x%04x 0x%04x",
+		 codepoint1, codepoint2);
+}
+
+static inline char32_t
+char16_iterator_next_char32_t(char16_iterator *iterator)
+{
+	char32_t	codepoint;
+
+	Assert(char16_iterator_has_more(iterator));
+	codepoint = char16_load(iterator->p++);
+
+	if (unlikely(is_utf16_surrogate_first(codepoint)))
+	{
+		if (!char16_iterator_has_more(iterator))
+			char16_iterator_report_short_pair(codepoint);
+		codepoint = surrogate_pair_to_codepoint(codepoint,
+												char16_load(iterator->p++));
+	}
+
+	return codepoint;
+}
+
+static inline char16_t
+char16_iterator_next_char16_t(char16_iterator *iterator)
+{
+	Assert(char16_iterator_has_more(iterator));
+	return char16_load(iterator->p++);
+}
+
+static pg_attribute_always_inline size_t
+char16_iterator_store_mb__sb(char16_iterator *iterator, char *dst,
+							 unsigned char max_char, int encoding)
+{
+	char16_t	codepoint;
+
+	Assert(char16_iterator_has_more(iterator));
+
+	codepoint = char16_load(iterator->p++);
+	if (unlikely(codepoint > max_char))
+		elog(ERROR,
+			 "no conversion from \"UTF-16\" to \"%s\" is available for the codepoint %04x",
+			 pg_encoding_to_char(encoding),
+			 codepoint);
+	*dst = codepoint;
+
+	return 1;
+}
+
+static inline size_t
+char16_iterator_store_mb__ascii(char16_iterator *iterator, char *dst,
+								int encoding)
+{
+	/* Unicode can be cast to ASCII after 7-bit range check. */
+	return char16_iterator_store_mb__sb(iterator, dst, 0x7f, encoding);
+}
+
+static inline size_t
+char16_iterator_store_mb__latin1(char16_iterator *iterator, char *dst,
+								 int encoding)
+{
+	/* Unicode can be cast to LATIN1 after 8-bit range check. */
+	Assert(encoding == PG_LATIN1);
+	return char16_iterator_store_mb__sb(iterator, dst, 0xff, encoding);
+}
+
+static inline size_t
+char16_iterator_store_mb__utf8(char16_iterator *iterator, char *dst,
+							   int encoding)
+{
+	char32_t	codepoint;
+
+	Assert(encoding == PG_UTF8);
+	Assert(char16_iterator_has_more(iterator));
+
+	codepoint = char16_load(iterator->p++);
+
+	/* Start of a surrogate pair? */
+	if (unlikely(is_utf16_surrogate_first(codepoint)))
+	{
+		char16_t	codepoint2;
+
+		if (unlikely(!char16_iterator_has_more(iterator)))
+			char16_iterator_report_short_pair(codepoint);
+		codepoint2 = char16_load(iterator->p++);
+		if (unlikely(!is_utf16_surrogate_second(codepoint2)))
+			char16_iterator_report_bad_pair(codepoint, codepoint2);
+		codepoint = surrogate_pair_to_codepoint(codepoint, codepoint2);
+	}
+
+	unicode_to_utf8(codepoint, (unsigned char *) dst);
+	return unicode_utf8len(codepoint);
+}
+
+/*
+ * The destination must have space for MAX_MB_LEN_PER_UTF32_CODEPOINT bytes,
+ * because UTF-16 surrogate pairs are combined to UTF-32.
+ *
+ * Call one of the specializations directly to avoid dispatching overhead.
+ */
+static inline size_t
+char16_iterator_store_mb(char16_iterator *iterator, char *dst, int encoding)
+{
+	switch (encoding)
+	{
+		case PG_UTF8:
+			return char16_iterator_store_mb__utf8(iterator, dst, encoding);
+		case PG_LATIN1:
+			return char16_iterator_store_mb__latin1(iterator, dst, encoding);
+		default:
+			return char16_iterator_store_mb__ascii(iterator, dst, encoding);
+	}
+}
+
+/*
+ * char16_iterator_store_mb() for database encoding.
+ */
+static inline size_t
+char16_iterator_store_local(char16_iterator *iterator, char *dst)
+{
+	return char16_iterator_store_mb(iterator, dst, GetDatabaseEncoding());
+}
+
+/*
+ * Skip one UTF-32 codepoint.  The iterator must not be exhausted.
+ */
+static inline void
+char16_iterator_advance(char16_iterator *iterator)
+{
+	char16_t	codepoint;
+
+	Assert(char16_iterator_has_more(iterator));
+	codepoint = char16_load(iterator->p++);
+	if (unlikely(is_utf16_surrogate_first(codepoint)))
+	{
+		if (!char16_iterator_has_more(iterator))
+			char16_iterator_report_short_pair(codepoint);
+		iterator->p++;
+	}
+}
+
+/*
+ * Skip as many UTF-32 codepoints as possible, returning the number that were
+ * skipped before the string ended.
+ */
+static inline size_t
+char16_iterator_advance_n(char16_iterator *iterator, size_t n)
+{
+	size_t		distance = 0;
+
+	while (n > 0 && char16_iterator_has_more(iterator))
+	{
+		char16_iterator_advance(iterator);
+		distance++;
+		n--;
+	}
+
+	return distance;
+}
+
+#endif							/* STRING_ITERATOR_H */
diff --git a/src/include/mb/unicode_strings.h b/src/include/mb/unicode_strings.h
new file mode 100644
index 00000000000..84f6edbfa83
--- /dev/null
+++ b/src/include/mb/unicode_strings.h
@@ -0,0 +1,254 @@
+/*-------------------------------------------------------------------------
+ *
+ * unicode_strings.h
+ *	  Support functions for converting and comparing Unicode encodings.
+ *
+ * Limited support is available in all database encodings, but only the ASCII
+ * or LATIN1 range that maps directly to Unicode.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/mb/unicode_strings.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNICODE_STRINGS_H
+#define UNICODE_STRINGS_H
+
+#include "mb/string_iterator.h"
+#include "mb/unicode_types.h"
+
+/*
+ * Constants used for worst-case buffer management for conversions.  Single
+ * UTF-16 codepoints map to 1, 2 or 3-byte UTF-8 sequences (basic plane).
+ * UTF-16 surrogate pairs map to 4-byte UTF-8 sequences, but that works out to
+ * 2 bytes of UTF-8 per UTF-16 codepoint.
+ */
+#define MAX_CHAR16_CODEPOINTS_PER_UTF8_BYTE 1
+#define MAX_CHAR16_CODEPOINTS_PER_UTF8_CHAR 2
+#define MAX_UTF8_LEN_PER_CHAR16_CODEPOINT   3
+#define MAX_UTF8_LEN_PER_CHAR32_CODEPOINT   4
+
+/*
+ * Since we don't currently support transcodings other than ASCII and LATIN1
+ * (which are strict subsets of Unicode by definition), the corresponding
+ * values for other encodings are 1.  We still define and use these more
+ * general macro names, in anticipation of potential transcoding support.
+ */
+#define MAX_MB_LEN_PER_CHAR16_CODEPOINT   MAX_UTF8_LEN_PER_CHAR16_CODEPOINT
+#define MAX_MB_LEN_PER_CHAR32_CODEPOINT   MAX_UTF8_LEN_PER_CHAR32_CODEPOINT
+#define MAX_CHAR16_CODEPOINTS_PER_MB_BYTE MAX_CHAR16_CODEPOINTS_PER_UTF8_BYTE
+#define MAX_CHAR32_CODEPOINTS_PER_MB_BYTE MAX_CHAR32_CODEPOINTS_PER_UTF8_BYTE
+
+/*
+ * How many UTF-16 codepoints might the given database encoding string occupy?
+ * TODO: Also provide an _exact_size() function?
+ */
+static inline size_t
+mb_to_char16_max_size(size_t size)
+{
+	return size * MAX_CHAR16_CODEPOINTS_PER_MB_BYTE;
+}
+
+/*
+ * How many bytes of database encoding might the given UTF-16 string occupy?
+ * TODO: Also provide an _exact_size() function?
+ */
+static inline size_t
+char16_to_mb_max_size(size_t size)
+{
+	return size * MAX_MB_LEN_PER_CHAR16_CODEPOINT;
+}
+
+static pg_attribute_always_inline size_t
+mb_to_char16__template(storage_char16_t *dst,
+					   const char *src, size_t src_size, int src_encoding,
+					   mb_iterator_store_char16_fn store_char16)
+{
+	mb_iterator iter = MB_ITERATOR_INIT(src, src_size, src_encoding);
+	storage_char16_t *p = dst;
+
+	while (mb_iterator_has_more(&iter))
+		p += store_char16(&iter, p);
+
+	return p - dst;
+}
+
+#define GENERATE_MB_TO_CHAR16(encoding) \
+static inline size_t \
+mb_to_char16__##encoding(storage_char16_t *dst, \
+						const char *src, size_t src_size, int src_encoding) \
+{ \
+	return mb_to_char16__template(dst, src, src_size, src_encoding, \
+								 mb_iterator_store_char16__##encoding); \
+}
+GENERATE_MB_TO_CHAR16(utf8);
+GENERATE_MB_TO_CHAR16(latin1);
+GENERATE_MB_TO_CHAR16(ascii);
+
+static inline size_t
+mb_to_char16(storage_char16_t *dst,
+			 const char *src, size_t src_size, int src_encoding)
+{
+	switch (src_encoding)
+	{
+		case PG_UTF8:
+			return mb_to_char16__utf8(dst, src, src_size, src_encoding);
+		case PG_LATIN1:
+			return mb_to_char16__latin1(dst, src, src_size, src_encoding);
+		default:
+			return mb_to_char16__ascii(dst, src, src_size, src_encoding);
+	}
+}
+
+static inline size_t
+local_to_char16(storage_char16_t *dst, const char *src, size_t src_size)
+{
+	return mb_to_char16(dst, src, src_size, GetDatabaseEncoding());
+}
+
+static pg_attribute_always_inline size_t
+char16_to_mb__template(char *dst, int dst_encoding,
+					   const storage_char16_t *src, size_t src_size,
+					   char16_iterator_store_mb_fn store_mb)
+{
+	char16_iterator iter = CHAR16_ITERATOR_INIT(src, src_size);
+	char	   *p = dst;
+
+	while (char16_iterator_has_more(&iter))
+		p += store_mb(&iter, p, dst_encoding);
+
+	return p - dst;
+}
+
+/*
+ * Try to inline char16_iterator_store_mb__XXX specializations into
+ * char16_to_local__XXX specializations.
+ */
+#define GENERATE_CHAR16_TO_MB(encoding) \
+static inline size_t \
+char16_to_mb__##encoding(char *dst, int dst_encoding, \
+						const storage_char16_t *src, size_t src_size) \
+{ \
+	return char16_to_mb__template(dst, dst_encoding, src, src_size, \
+								 char16_iterator_store_mb__##encoding); \
+}
+GENERATE_CHAR16_TO_MB(ascii);
+GENERATE_CHAR16_TO_MB(latin1);
+GENERATE_CHAR16_TO_MB(utf8);
+
+static inline size_t
+char16_to_mb(char *dst, int dst_encoding,
+			 const storage_char16_t *src, size_t src_size)
+{
+	switch (dst_encoding)
+	{
+		case PG_UTF8:
+			return char16_to_mb__utf8(dst, dst_encoding, src, src_size);
+		case PG_LATIN1:
+			return char16_to_mb__latin1(dst, dst_encoding, src, src_size);
+		default:
+			return char16_to_mb__ascii(dst, dst_encoding, src, src_size);
+	}
+}
+
+static inline size_t
+char16_to_local(char *dst, const storage_char16_t *src, size_t src_size)
+{
+	return char16_to_mb(dst, GetDatabaseEncoding(), src, src_size);
+}
+
+static inline size_t
+char16_to_local_cstr(char *dst, const storage_char16_t *src, size_t src_size)
+{
+	size_t		size = char16_to_local(dst, src, src_size);
+
+	dst[size] = 0;
+	return size;
+}
+
+static pg_attribute_always_inline int
+char16_mb_cmp__template(const storage_char16_t *data1, size_t size1,
+						const char *data2, size_t size2, int encoding2,
+						mb_iterator_next_char32_t_fn next_char32_t)
+{
+	char16_iterator iter1 = CHAR16_ITERATOR_INIT(data1, size1);
+	mb_iterator iter2 = MB_ITERATOR_INIT(data2, size2, encoding2);
+
+	while (char16_iterator_has_more(&iter1) &&
+		   mb_iterator_has_more(&iter2))
+	{
+		char32_t	codepoint1 = char16_iterator_next_char32_t(&iter1);
+		char32_t	codepoint2 = next_char32_t(&iter2);
+
+		if (codepoint1 < codepoint2)
+			return -1;
+		else if (codepoint1 > codepoint2)
+			return 1;
+	}
+
+	if (mb_iterator_has_more(&iter2))
+		return -1;
+	else if (char16_iterator_has_more(&iter1))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Try to inline char16_iterator_store_mb__XXX specializations into
+ * char16_mb_cmp__XXX specializations.
+ */
+#define GENERATE_CHAR16_MB_CMP(encoding) \
+static inline size_t \
+char16_mb_cmp__##encoding(const storage_char16_t *data1, size_t size1, \
+						 const char *data2, size_t size2, int encoding2) \
+{ \
+	return char16_mb_cmp__template(data1, size1, data2, size2, encoding2, \
+								  mb_iterator_next_char32_t__##encoding); \
+}
+GENERATE_CHAR16_MB_CMP(ascii);
+GENERATE_CHAR16_MB_CMP(latin1);
+GENERATE_CHAR16_MB_CMP(utf8);
+
+static inline int
+char16_mb_cmp(const storage_char16_t *data1, size_t size1,
+			  const char *data2, size_t size2, int encoding2)
+{
+	switch (encoding2)
+	{
+		case PG_UTF8:
+			return char16_mb_cmp__utf8(data1, size1, data2, size2, encoding2);
+		case PG_LATIN1:
+			return char16_mb_cmp__latin1(data1, size1, data2, size2, encoding2);
+		default:
+			return char16_mb_cmp__ascii(data1, size1, data2, size2, encoding2);
+	}
+}
+
+static inline int
+mb_char16_cmp(const char *data1, size_t size1, int encoding1,
+			  const storage_char16_t *data2, size_t size2)
+{
+	int			result = char16_mb_cmp(data2, size2, data1, size1, encoding1);
+
+	INVERT_COMPARE_RESULT(result);
+	return result;
+}
+
+static inline int
+char16_local_cmp(const storage_char16_t *data1, size_t size1,
+				 const char *data2, size_t size2)
+{
+	return char16_mb_cmp(data1, size1, data2, size2, GetDatabaseEncoding());
+}
+
+static inline int
+local_char16_cmp(const char *data1, size_t size1,
+				 const storage_char16_t *data2, size_t size2)
+{
+	return mb_char16_cmp(data1, size1, GetDatabaseEncoding(), data2, size2);
+}
+
+#endif							/* UNICODE_STRINGS_H */
diff --git a/src/include/mb/unicode_types.h b/src/include/mb/unicode_types.h
new file mode 100644
index 00000000000..e6e85029266
--- /dev/null
+++ b/src/include/mb/unicode_types.h
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * unicode_types.h
+ *	  Types for representing Unicode.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/mb/unicode.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNICODE_TYPES_H
+#define UNICODE_TYPES_H
+
+/*
+ * The type used to represent UTF-16 codepoints in varlena objects.  All
+ * access to storage_char16_t and knowledge of its layout should be contained
+ * in this file.
+ */
+typedef struct storage_char16_t
+{
+	uint8_t		high;
+	uint8_t		low;
+} storage_char16_t;
+
+static_assert(alignof(storage_char16_t) == 1, "bad alignment for varlena");
+static_assert(sizeof(storage_char16_t) == sizeof(char16_t), "bad size");
+
+/* Read char16_t from storage_char16_t. */
+static inline char16_t
+char16_load(const storage_char16_t *s)
+{
+	return (s->high << 8) | s->low;
+}
+
+/* Write char16_t to storage_char16_t. */
+static inline void
+char16_store(storage_char16_t *s, char16_t c)
+{
+	s->high = c >> 8;
+	s->low = c;
+}
+
+/* Codepoint order of two storage_char16_t strings of equal size. */
+static inline int
+char16_cmp1(const storage_char16_t *s1,
+			const storage_char16_t *s2,
+			size_t size)
+{
+	/* Fast binary comparison, motivating big-endian representation. */
+	return memcmp(s1, s2, sizeof(storage_char16_t) * size);
+}
+
+/* Tell ICU's UCharIterator how to read from storage_char16_t format. */
+#define UITER_SET_STORAGE_CHAR_T(iterator, s, size) \
+	uiter_setUTF16BE((iterator), \
+					 (const char *) (s), \
+					 (size) * sizeof(storage_char16_t))
+
+#endif							/* UNICODE_TYPES_H */
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 444350bb803..e468e57e358 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -13,6 +13,7 @@
 #define _PG_LOCALE_
 
 #include "mb/pg_wchar.h"
+#include "mb/unicode_types.h"
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -67,6 +68,16 @@ struct collate_methods
 							 const char *arg2, ssize_t len2,
 							 pg_locale_t locale);
 
+	/* optional */
+	int			(*strncoll_char16) (const storage_char16_t *arg1, size_t len1,
+									const storage_char16_t *arg2, size_t len2,
+									pg_locale_t locale);
+
+	/* optional */
+	int			(*strncoll_char16_local) (const storage_char16_t *arg1, size_t len1,
+										  const char *arg2, size_t len2,
+										  pg_locale_t locale);
+
 	/* required */
 	size_t		(*strnxfrm) (char *dest, size_t destsize,
 							 const char *src, ssize_t srclen,
@@ -188,6 +199,15 @@ extern size_t pg_downcase_ident(char *dst, size_t dstsize,
 extern int	pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
 extern int	pg_strncoll(const char *arg1, ssize_t len1,
 						const char *arg2, ssize_t len2, pg_locale_t locale);
+extern int	pg_strncoll_char16(const storage_char16_t *data1, size_t size1,
+							   const storage_char16_t *data2, size_t size2,
+							   pg_locale_t locale);
+extern int	pg_strncoll_char16_local(const storage_char16_t *data1, size_t size1,
+									 const char *data2, size_t size2,
+									 pg_locale_t locale);
+extern int	pg_strncoll_local_char16(const char *data1, size_t size1,
+									 const storage_char16_t *data2, size_t size2,
+									 pg_locale_t locale);
 extern bool pg_strxfrm_enabled(pg_locale_t locale);
 extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
 						 pg_locale_t locale);
diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out
index 2ecd255f182..4222f01c675 100644
--- a/src/test/regress/expected/encoding.out
+++ b/src/test/regress/expected/encoding.out
@@ -8,8 +8,12 @@ SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
 \set regresslib :libdir '/regress' :dlsuffix
 CREATE FUNCTION test_bytea_to_text(bytea) RETURNS text
     AS :'regresslib' LANGUAGE C STRICT;
+CREATE FUNCTION test_bytea_to_utf16(bytea) RETURNS utf16
+    AS :'regresslib' LANGUAGE C STRICT;
 CREATE FUNCTION test_text_to_bytea(text) RETURNS bytea
     AS :'regresslib' LANGUAGE C STRICT;
+CREATE FUNCTION test_utf16_to_bytea(text) RETURNS bytea
+    AS :'regresslib' LANGUAGE C STRICT;
 CREATE FUNCTION test_mblen_func(text, text, text, int) RETURNS int
     AS :'regresslib' LANGUAGE C STRICT;
 CREATE FUNCTION test_text_to_wchars(text, text) RETURNS int[]
@@ -401,6 +405,32 @@ SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
  🚀
 (1 row)
 
+-- storage format of UTF-16 is big-endian even on little-endian system
+SELECT test_bytea_to_utf16('\x0041') = 'A';
+ ?column? 
+----------
+ t
+(1 row)
+
+-- odd number of bytes (corrupted storage), trailing byte ignored
+SELECT test_bytea_to_utf16('\x004100') = 'A';
+ ?column? 
+----------
+ f
+(1 row)
+
+SELECT test_bytea_to_utf16('\x004100');
+ test_bytea_to_utf16 
+---------------------
+ A
+(1 row)
+
+-- incomplete surrogate pair
+SELECT test_bytea_to_utf16('\xd83d');
+ERROR:  invalid UTF-16 sequence 0xd83d
+-- bad second codepoint in surrogate pair
+SELECT test_bytea_to_utf16('\xd83dbeef');
+ERROR:  invalid UTF-16 sequence 0xd83d 0xbeef
 DROP TABLE encoding_tests;
 DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
@@ -409,7 +439,9 @@ DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_valid_server_encoding;
 DROP FUNCTION test_mblen_func;
 DROP FUNCTION test_bytea_to_text;
+DROP FUNCTION test_bytea_to_utf16;
 DROP FUNCTION test_text_to_bytea;
+DROP FUNCTION test_utf16_to_bytea;
 -- substring slow path: multi-byte escape char vs. multi-byte pattern char.
 SELECT SUBSTRING('a' SIMILAR U&'\00AC' ESCAPE U&'\00A7');
  substring 
diff --git a/src/test/regress/expected/type_sanity.out b/src/test/regress/expected/type_sanity.out
index 1d21d3eb446..3391c1984a0 100644
--- a/src/test/regress/expected/type_sanity.out
+++ b/src/test/regress/expected/type_sanity.out
@@ -751,6 +751,7 @@ CREATE TABLE tab_core_types AS SELECT
   'abc'::varchar,
   'name'::name,
   'txt'::text,
+  'utf16'::utf16,
   true::bool,
   E'\\xDEADBEEF'::bytea,
   B'10001'::bit,
diff --git a/src/test/regress/expected/unicode.out b/src/test/regress/expected/unicode.out
index 1e06de22649..4b761eec6e9 100644
--- a/src/test/regress/expected/unicode.out
+++ b/src/test/regress/expected/unicode.out
@@ -105,3 +105,137 @@ ORDER BY num;
 
 SELECT is_normalized('abc', 'def');  -- run-time error
 ERROR:  invalid normalization form: def
+-- Interesting thresholds for UTF-8 and UTF-16 encoding
+WITH octet_length_thresholds(t, description) AS (VALUES
+  (U&'\+000001', 'First 1-byte UTF-8 sequence supported by PostgreSQL'),
+  (U&'\+00007F', 'Final 1-byte UTF-8 sequence'),
+  (U&'\+000080', 'First 2-byte UTF-8 sequence'),
+  (U&'\+0007FF', 'Final 2-byte UTF-8 sequence'),
+  (U&'\+000800', 'First 3-byte UTF-8 sequence'),
+  (U&'\+00FFFF', 'Final 3-byte UTF-8 sequence (end of BMP)'),
+  (U&'\+010000', 'First 4-byte UTF-8 sequence, UTF-16 pair'),
+  (U&'\+10FFFF', 'Final valid codepoint'))
+SELECT to_hex(ascii(t)),
+       description,
+       octet_length(t::text) AS utf8_octets,
+       octet_length(t::utf16) AS utf16_octets,
+       length(t::text) AS utf8_length,
+       length(t::utf16) AS utf16_length
+FROM octet_length_thresholds;
+ to_hex |                     description                     | utf8_octets | utf16_octets | utf8_length | utf16_length 
+--------+-----------------------------------------------------+-------------+--------------+-------------+--------------
+ 1      | First 1-byte UTF-8 sequence supported by PostgreSQL |           1 |            2 |           1 |            1
+ 7f     | Final 1-byte UTF-8 sequence                         |           1 |            2 |           1 |            1
+ 80     | First 2-byte UTF-8 sequence                         |           2 |            2 |           1 |            1
+ 7ff    | Final 2-byte UTF-8 sequence                         |           2 |            2 |           1 |            1
+ 800    | First 3-byte UTF-8 sequence                         |           3 |            2 |           1 |            1
+ ffff   | Final 3-byte UTF-8 sequence (end of BMP)            |           3 |            2 |           1 |            1
+ 10000  | First 4-byte UTF-8 sequence, UTF-16 pair            |           4 |            4 |           1 |            1
+ 10ffff | Final valid codepoint                               |           4 |            4 |           1 |            1
+(8 rows)
+
+-- Out of range codepoints
+SELECT U&'\+000000';
+ERROR:  invalid Unicode escape value
+LINE 1: SELECT U&'\+000000';
+                  ^
+SELECT U&'\+110000';
+ERROR:  invalid Unicode escape value
+LINE 1: SELECT U&'\+110000';
+                  ^
+CREATE FUNCTION check_text_op(left_string text,
+                              left_type text,
+                              right_string text,
+                              right_type text,
+                              op text)
+RETURNS boolean
+LANGUAGE plpgsql
+AS
+$$
+DECLARE
+  format text;
+  text_e text;
+  text_i int;
+  text_b boolean;
+  test_e text;
+  test_i int;
+  test_b boolean;
+BEGIN
+  -- all cross-type results against text, text
+  IF op = 'cmp' THEN
+    format := '%s%scmp(''%s''::%s, ''%s''::%s)';
+    text_e := format(format, 'bttext', '', left_string, 'text', right_string, 'text');
+    EXECUTE format('SELECT sign(%s)', text_e) INTO text_i;
+    test_e := format(format, left_type, CASE WHEN left_type = right_type THEN '' ELSE right_type END, left_string, left_type, right_string, right_type);
+    EXECUTE format('SELECT sign(%s)', test_e) INTO test_i;
+    IF test_i <> text_i THEN
+      RAISE NOTICE '% -> %, but % -> %', text_e, text_i, test_e, test_i;
+    END IF;
+  ELSE
+    format := '''%s''::%s %s ''%s''::%s';
+    text_e := format(format, left_string, 'text', op, right_string, 'text');
+    EXECUTE format('SELECT %s', text_e) INTO text_b;
+    test_e := format(format, left_string, left_type, op, right_string, right_type);
+    EXECUTE format('SELECT %s', test_e) INTO test_b;
+    IF test_b <> text_b THEN
+      RAISE NOTICE '% -> %, but % -> %', text_e, text_b, test_e, test_b;
+    END IF;
+  END IF;
+  RETURN true;
+END;
+$$;
+WITH strings (s) AS (VALUES ('a'), ('aa'), ('aaa')),
+     ops (o)     AS (VALUES ('<'), ('<='), ('='), ('<>'), ('>='), ('>'), ('cmp')),
+     types (t)   AS (VALUES ('text'), ('name'), ('utf16'))
+SELECT count(check_text_op(left_string.s,
+                           left_type.t,
+                           right_string.s,
+                           right_type.t,
+                           op.o))
+FROM       strings left_string
+CROSS JOIN strings right_string
+CROSS JOIN types   left_type
+CROSS JOIN types   right_type
+CROSS JOIN ops     op
+WHERE left_type.t = 'utf16' OR right_type.t = 'utf16';
+ count 
+-------
+   315
+(1 row)
+
+WITH examples(language, string) AS
+(VALUES
+  ('English',  'In a hole in the ground there lived a hobbit.'),
+  ('Spanish',  'En un agujero en el suelo, vivía un hobbit.'),
+  ('Russian',  'В норе под землей жил-был хоббит.'),
+  ('Arabic',   'كان يعيش هوبيت في حفرة في الأرض.'),
+  ('Hebrew',   'בתוך חור באדמה חי הוביט.'),
+  ('Greek',    'Σε μια τρύπα στο έδαφος ζούσε ένα χόμπιτ.'),
+  ('Korean',   '땅속 어느 구멍에 한 호빗이 살고 있었다.'),
+  ('Hindi',    'जमीन में बने एक गड्ढे में एक हॉबिट रहता था।'),
+  ('Tamil',    'அந்த நிலத்தில் ஒரு துளையில் ஒரு ஹாபிட் வசித்து வந்தது.'),
+  ('Chinese',  '在地下一个洞里，住着一个霍比特人。'),
+  ('Japanese', '穴のなかに、ひとりのホビットが暮らしていた。'))
+SELECT language,
+       octet_length(string::text) || '→' || octet_length(string::utf16) AS octets,
+       to_char(ROUND(100 *
+                     ((octet_length(string::utf16)::float /
+                      (octet_length(string::text)::float) - 1.0))),
+               'S999%') AS delta,
+       string
+FROM examples;
+ language | octets  | delta |                     string                      
+----------+---------+-------+-------------------------------------------------
+ English  | 45→90   | +100% | In a hole in the ground there lived a hobbit.
+ Spanish  | 44→86   |  +95% | En un agujero en el suelo, vivía un hobbit.
+ Russian  | 59→66   |  +12% | В норе под землей жил-был хоббит.
+ Arabic   | 57→64   |  +12% | كان يعيش هوبيت في حفرة في الأرض.
+ Hebrew   | 43→48   |  +12% | בתוך חור באדמה חי הוביט.
+ Greek    | 74→82   |  +11% | Σε μια τρύπα στο έδαφος ζούσε ένα χόμπιτ.
+ Korean   | 55→46   |  -16% | 땅속 어느 구멍에 한 호빗이 살고 있었다.
+ Hindi    | 111→86  |  -23% | जमीन में बने एक गड्ढे में एक हॉबिट रहता था।
+ Tamil    | 146→108 |  -26% | அந்த நிலத்தில் ஒரு துளையில் ஒரு ஹாபிட் வசித்து வந்தது.
+ Chinese  | 51→34   |  -33% | 在地下一个洞里，住着一个霍比特人。
+ Japanese | 66→44   |  -33% | 穴のなかに、ひとりのホビットが暮らしていた。
+(11 rows)
+
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 74fe6fdce9d..617c719a804 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -1144,6 +1144,22 @@ test_text_to_bytea(PG_FUNCTION_ARGS)
 	PG_RETURN_BYTEA_P(PG_GETARG_TEXT_PP(0));
 }
 
+/* Convert bytea to utf16 without validation. */
+PG_FUNCTION_INFO_V1(test_bytea_to_utf16);
+Datum
+test_bytea_to_utf16(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_UTF16_P(PG_GETARG_BYTEA_PP(0));
+}
+
+/* And the reverse. */
+PG_FUNCTION_INFO_V1(test_utf16_to_bytea);
+Datum
+test_utf16_to_bytea(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BYTEA_P(PG_GETARG_UTF16_PP(0));
+}
+
 /* Corruption tests in C. */
 PG_FUNCTION_INFO_V1(test_mblen_func);
 Datum
diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql
index 07d7dc8ff18..7a0e386f5fb 100644
--- a/src/test/regress/sql/encoding.sql
+++ b/src/test/regress/sql/encoding.sql
@@ -11,8 +11,12 @@ SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
 
 CREATE FUNCTION test_bytea_to_text(bytea) RETURNS text
     AS :'regresslib' LANGUAGE C STRICT;
+CREATE FUNCTION test_bytea_to_utf16(bytea) RETURNS utf16
+    AS :'regresslib' LANGUAGE C STRICT;
 CREATE FUNCTION test_text_to_bytea(text) RETURNS bytea
     AS :'regresslib' LANGUAGE C STRICT;
+CREATE FUNCTION test_utf16_to_bytea(text) RETURNS bytea
+    AS :'regresslib' LANGUAGE C STRICT;
 CREATE FUNCTION test_mblen_func(text, text, text, int) RETURNS int
     AS :'regresslib' LANGUAGE C STRICT;
 CREATE FUNCTION test_text_to_wchars(text, text) RETURNS int[]
@@ -219,6 +223,16 @@ ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
 UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
 SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
 
+-- storage format of UTF-16 is big-endian even on little-endian system
+SELECT test_bytea_to_utf16('\x0041') = 'A';
+-- odd number of bytes (corrupted storage), trailing byte ignored
+SELECT test_bytea_to_utf16('\x004100') = 'A';
+SELECT test_bytea_to_utf16('\x004100');
+-- incomplete surrogate pair
+SELECT test_bytea_to_utf16('\xd83d');
+-- bad second codepoint in surrogate pair
+SELECT test_bytea_to_utf16('\xd83dbeef');
+
 DROP TABLE encoding_tests;
 DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
@@ -227,7 +241,9 @@ DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_valid_server_encoding;
 DROP FUNCTION test_mblen_func;
 DROP FUNCTION test_bytea_to_text;
+DROP FUNCTION test_bytea_to_utf16;
 DROP FUNCTION test_text_to_bytea;
+DROP FUNCTION test_utf16_to_bytea;
 
 
 -- substring slow path: multi-byte escape char vs. multi-byte pattern char.
diff --git a/src/test/regress/sql/type_sanity.sql b/src/test/regress/sql/type_sanity.sql
index 95d5b6e0915..c14f0a3c82f 100644
--- a/src/test/regress/sql/type_sanity.sql
+++ b/src/test/regress/sql/type_sanity.sql
@@ -567,6 +567,7 @@ CREATE TABLE tab_core_types AS SELECT
   'abc'::varchar,
   'name'::name,
   'txt'::text,
+  'utf16'::utf16,
   true::bool,
   E'\\xDEADBEEF'::bytea,
   B'10001'::bit,
diff --git a/src/test/regress/sql/unicode.sql b/src/test/regress/sql/unicode.sql
index e50adb68ed0..0eeb86801b6 100644
--- a/src/test/regress/sql/unicode.sql
+++ b/src/test/regress/sql/unicode.sql
@@ -36,3 +36,104 @@ FROM
 ORDER BY num;
 
 SELECT is_normalized('abc', 'def');  -- run-time error
+
+-- Interesting thresholds for UTF-8 and UTF-16 encoding
+
+WITH octet_length_thresholds(t, description) AS (VALUES
+  (U&'\+000001', 'First 1-byte UTF-8 sequence supported by PostgreSQL'),
+  (U&'\+00007F', 'Final 1-byte UTF-8 sequence'),
+  (U&'\+000080', 'First 2-byte UTF-8 sequence'),
+  (U&'\+0007FF', 'Final 2-byte UTF-8 sequence'),
+  (U&'\+000800', 'First 3-byte UTF-8 sequence'),
+  (U&'\+00FFFF', 'Final 3-byte UTF-8 sequence (end of BMP)'),
+  (U&'\+010000', 'First 4-byte UTF-8 sequence, UTF-16 pair'),
+  (U&'\+10FFFF', 'Final valid codepoint'))
+SELECT to_hex(ascii(t)),
+       description,
+       octet_length(t::text) AS utf8_octets,
+       octet_length(t::utf16) AS utf16_octets,
+       length(t::text) AS utf8_length,
+       length(t::utf16) AS utf16_length
+FROM octet_length_thresholds;
+-- Out of range codepoints
+SELECT U&'\+000000';
+SELECT U&'\+110000';
+
+CREATE FUNCTION check_text_op(left_string text,
+                              left_type text,
+                              right_string text,
+                              right_type text,
+                              op text)
+RETURNS boolean
+LANGUAGE plpgsql
+AS
+$$
+DECLARE
+  format text;
+  text_e text;
+  text_i int;
+  text_b boolean;
+  test_e text;
+  test_i int;
+  test_b boolean;
+BEGIN
+  -- all cross-type results against text, text
+  IF op = 'cmp' THEN
+    format := '%s%scmp(''%s''::%s, ''%s''::%s)';
+    text_e := format(format, 'bttext', '', left_string, 'text', right_string, 'text');
+    EXECUTE format('SELECT sign(%s)', text_e) INTO text_i;
+    test_e := format(format, left_type, CASE WHEN left_type = right_type THEN '' ELSE right_type END, left_string, left_type, right_string, right_type);
+    EXECUTE format('SELECT sign(%s)', test_e) INTO test_i;
+    IF test_i <> text_i THEN
+      RAISE NOTICE '% -> %, but % -> %', text_e, text_i, test_e, test_i;
+    END IF;
+  ELSE
+    format := '''%s''::%s %s ''%s''::%s';
+    text_e := format(format, left_string, 'text', op, right_string, 'text');
+    EXECUTE format('SELECT %s', text_e) INTO text_b;
+    test_e := format(format, left_string, left_type, op, right_string, right_type);
+    EXECUTE format('SELECT %s', test_e) INTO test_b;
+    IF test_b <> text_b THEN
+      RAISE NOTICE '% -> %, but % -> %', text_e, text_b, test_e, test_b;
+    END IF;
+  END IF;
+  RETURN true;
+END;
+$$;
+
+WITH strings (s) AS (VALUES ('a'), ('aa'), ('aaa')),
+     ops (o)     AS (VALUES ('<'), ('<='), ('='), ('<>'), ('>='), ('>'), ('cmp')),
+     types (t)   AS (VALUES ('text'), ('name'), ('utf16'))
+SELECT count(check_text_op(left_string.s,
+                           left_type.t,
+                           right_string.s,
+                           right_type.t,
+                           op.o))
+FROM       strings left_string
+CROSS JOIN strings right_string
+CROSS JOIN types   left_type
+CROSS JOIN types   right_type
+CROSS JOIN ops     op
+WHERE left_type.t = 'utf16' OR right_type.t = 'utf16';
+
+WITH examples(language, string) AS
+(VALUES
+  ('English',  'In a hole in the ground there lived a hobbit.'),
+  ('Spanish',  'En un agujero en el suelo, vivía un hobbit.'),
+  ('Russian',  'В норе под землей жил-был хоббит.'),
+  ('Arabic',   'كان يعيش هوبيت في حفرة في الأرض.'),
+  ('Hebrew',   'בתוך חור באדמה חי הוביט.'),
+  ('Greek',    'Σε μια τρύπα στο έδαφος ζούσε ένα χόμπιτ.'),
+  ('Korean',   '땅속 어느 구멍에 한 호빗이 살고 있었다.'),
+  ('Hindi',    'जमीन में बने एक गड्ढे में एक हॉबिट रहता था।'),
+  ('Tamil',    'அந்த நிலத்தில் ஒரு துளையில் ஒரு ஹாபிட் வசித்து வந்தது.'),
+  ('Chinese',  '在地下一个洞里，住着一个霍比特人。'),
+  ('Japanese', '穴のなかに、ひとりのホビットが暮らしていた。'))
+SELECT language,
+       octet_length(string::text) || '→' || octet_length(string::utf16) AS octets,
+       to_char(ROUND(100 *
+                     ((octet_length(string::utf16)::float /
+                      (octet_length(string::text)::float) - 1.0))),
+               'S999%') AS delta,
+       string
+FROM examples;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 49dfb662abc..8ff10a36ee6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3649,6 +3649,7 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+char16_iterator
 char16_t
 char32_t
 check_agg_arguments_context
@@ -3946,6 +3947,7 @@ mbcharacter_incrementer
 mbdisplaylen_converter
 mblen_converter
 mbstr_verifier
+mb_iterator
 memoize_hash
 memoize_iterator
 metastring
@@ -4313,6 +4315,7 @@ ssize_t
 standard_qp_extra
 stemmer_module
 stmtCacheEntry
+storage_char16_t
 storeInfo
 storeRes_func
 stream_stop_callback
@@ -4388,6 +4391,7 @@ unicodeStyleRowFormat
 unicode_linestyle
 unit_conversion
 unlogged_relation_entry
+utf16
 utf_local_conversion_func
 uuidKEY
 uuid_rc_t
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Experimenting with wider Unicode storage
@ 2026-04-16 01:23  Henson Choi <[email protected]>
  parent: Thomas Munro <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Henson Choi @ 2026-04-16 01:23 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Tatsuo Ishii <[email protected]>

Hi Thomas,

Thank you for sharing this very interesting and creative approach.
Encoding is indeed a crucial factor in capacity planning and
performance benchmarking — I find this direction quite compelling.

I'm currently working on a few other things, so my responses may not
always be quick, but I wanted to let you know I'm genuinely
interested in following this work.

As it happens, I'm currently collaborating with Ishii-san — who, as
you know, is one of the original architects of multibyte/CJK support
in PostgreSQL — on Row Pattern Recognition; that might also be a
thread worth keeping an eye on.

It also strikes me that this is a topic worth considering in the
context of the rapid growth of SNS and AI-generated data. The
pervasive use of emoji — which cannot be represented in legacy
encodings like EUC-KR at all — is in fact accelerating the migration
toward Unicode in Korea and other Asian markets. This makes the
storage efficiency of Unicode for CJK characters an increasingly
practical concern, not just a theoretical one.

I'd like to take some time to analyze the current situation around
character encoding in Korea — where both EUC-KR legacy systems and
UTF-8 coexist in complex ways — review the patches you've attached,
and then share some thoughts and feedback.

Best regards,
Henson

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Experimenting with wider Unicode storage
@ 2026-04-21 01:16  Henson Choi <[email protected]>
  parent: Henson Choi <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Henson Choi @ 2026-04-21 01:16 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Tatsuo Ishii <[email protected]>

Hi Thomas,

Thank you again for sharing this exploration, and for including
Korean in your experiment table.  Rather than comment on the
patch itself, let me offer a ground-level report on where Korean
encoding reality sits in April 2026, because the picture has
shifted enough that I think it is worth entering into the record
before this thread accumulates momentum on motivations that may
no longer fully hold on this side of the region.

UTF-8 has already won in Korea, largely by inertia rather than
active choice.  Public web statistics put .kr sites at roughly
96% UTF-8 with a small EUC-KR residual of about 4% [1] —
noticeably higher than the ~1% Shift-JIS residual on .jp [2],
but steadily shrinking.  The mechanism is mundane: modern Linux
distributions default to UTF-8 locales, PostgreSQL's initdb
inherits that, and every new cluster is therefore UTF-8 from
birth.  The remaining legacy installations are not "haven't
migrated yet" — they are "have decided not to migrate," which is
a different and much slower population.

A clarification that often trips people up: in Korean practice,
"EUC-KR" is the label written down and CP949 is what actually
moves on the wire.  Microsoft's UHC has been the Windows default
for decades, and the MIME label has simply stuck.  The historical
stack goes KS X 1001 (완성형, 2,350 syllables) → EUC-KR → CP949
(11,172 syllables) → UTF-8.  PostgreSQL's strict EUC_KR decoder
rejects the bytes CP949 adds, which occasionally causes real
incidents when Windows-exchanged files are loaded.  For any
design choice about "Korean legacy support", this matters — what
needs supporting is usually CP949, not EUC-KR proper.

Server encoding and client encoding are also routinely split.  A
common Korean deployment pattern is a PostgreSQL cluster with
UTF-8 as server encoding, while legacy Windows / Delphi / C++ /
older Java clients connect with client_encoding set to EUC-KR or
CP949 and let PostgreSQL transcode at the wire boundary.  Many
systems that look like "EUC-KR systems" from the outside are
actually UTF-8 storage with an EUC-KR wire.  The storage-layer
share of legacy is therefore probably smaller still than the
3.8% web figure would suggest.

On the Korean row of your table landing at -16% under UTF-16:
that is structural, not noise.  Modern Korean writing mandates
word-space separation (unlike Chinese and Japanese), has
effectively abandoned hanja since the 1990s, and freely
interleaves ASCII acronyms (IT, AI, CEO).  As a result Korean
carries the highest ASCII share among CJK languages, and UTF-16
pays for each ASCII position (one byte → two) in exactly the
range where the Hangul savings are meant to come from.  Columns
without spaces — names, titles, addresses — could approach -33%,
but general prose cannot.  Those same short columns are, however,
exactly where the compression angle I return to further below
captures the equivalent saving without a new data type.

Storage pressure, to the extent modern operators feel it at all,
has largely migrated to other layers.  Memory and disk have both
followed exponential price/volume curves, and the CPU cost of
text comparison has disappeared inside other costs — network,
storage I/O, planning, JIT — to the point of invisibility in
profiler output.  For OLTP, the 2-vs-3-byte difference on Korean
columns does not feel meaningful on modern hardware.  For bulk
scans where byte counts still do matter, the industry answer has
already been columnar + zstd, which routinely reaches 90%+
compression on natural-language text and flattens the
CJK-vs-Latin ratio to irrelevance.  Embedded and edge are not
PostgreSQL's primary target, and archival sits in zstd territory
too.  The domains that historically motivated "we must narrow
CJK storage" have either moved outside the PostgreSQL shape or
been absorbed by general-purpose compression.

Meanwhile the cultural arrow points toward more Unicode, not
less.  KakaoTalk (which saturates domestic messaging), Naver
comments, Instagram captions, and YouTube normalise emoji in
everyday prose, while AI-generated Korean text contributes
middle dots, em dashes, and curly quotes at a scale that was
not present a few years ago.  The share of non-EUC-KR content
in everyday Korean prose is, informally, rising steadily.  Each
emoji is four UTF-8 bytes and is unrepresentable in any legacy
encoding at all.
A partial-coverage alternative looks increasingly awkward against
that trend.

Korean upstream feedback on encoding has also been notably quiet
despite a very active de-Oracle migration wave in the late 2010s.
I suspect this silence is not apathy but absence of a felt
problem — most of the community has simply moved on.

I should be careful here.  The "Korean side needs narrower CJK
storage" argument was genuinely strong around 2010, and I
remember when it motivated serious engineering time.  It is much
weaker in 2026: UTF-8 has won by default, legacy survivors are
confined to wire protocols and specific applications, OLTP does
not feel the byte cost, and bulk scan is already handled
elsewhere.  I raise this not to dismiss the technical work — the
patch shows real craft and the exploration is interesting on its
own terms.  But if the cover-letter motivation rests partly on
"this will help East Asian users, including Korea," I wanted you
to have a ground-level report: for Korean users specifically, the
pressure may no longer be strong enough to justify the complexity
described.  The calculus may well differ in Japanese or Chinese
markets — that is not for me to say.

One broader question, then, that I wanted to put to you: there
are three distinct axes on which utf16 could be pursued — as a
server character set, as a data type, or as a compression angle.
The character-set direction runs straight into the "continuation
byte must not look like ASCII" rule, as you already noted, and
is therefore effectively closed on PostgreSQL.  The data-type
direction is the current patch, which carries substantial
catalogue and operator surface, while the storage wins mostly
accrue on wider values — where columnar + zstd is already doing
the work.  What still seems genuinely unaddressed in practice is
the short-value regime: word-sized strings such as names,
titles, cities, and tags, which fall below the TOAST compression
threshold and therefore never see a compressor at all.  Would
framing this as "a compression method effective on word-sized
values" be a more productive angle than either of the other two?
The storage outcome could be similar with much less surface area
to maintain.

A fair counter on memory, before I go on: disk pressure has
clearly migrated elsewhere, but shared_buffers and work_mem
remain finite, and compression primarily addresses the disk
side.  A data-type approach that goes far enough to shrink the
in-memory representation — modifying every string function
along the way — tends to become a degraded form of a new
character set: doing most of the character-set work without the
character-set slot in PostgreSQL's encoding machinery, which as
above is closed.  None of the three axes therefore cleanly
solves the in-memory case; for truly memory-bound CJK workloads
the honest answer is probably just more RAM.

One concrete instantiation of that compression angle, if Korean
capacity specifically is the example that matters: take CP949
(which is what actually circulates under the EUC-KR label) as a
compression base and, for any character CP949 cannot represent,
spell it inline as a readable textual escape such as \u2603 or
U+2603 rather than a binary marker byte.  Native Korean text
then stays at two bytes per Hangul, emoji and modern Unicode
remain fully representable (at a modest cost per occurrence),
the in-memory representation stays plain UTF-8, and the on-disk
byte stream stays entirely within ASCII + CP949 — no new marker
byte, no collision with existing code paths that scan for raw
ASCII bytes.  If the source text itself contains sequences that
look like the escape syntax (for instance documentation quoting
\u-style literals), a simple doubling rule disambiguates them;
such cases are vanishingly rare in Korean business data.  This
targets exactly the short-value regime above, with far less
surface than a new data type.

For tighter byte density, one could go further by devising a
dedicated binary-level encoding, or by wiring zstd's external
dictionary feature into the column-compression path with a
pre-trained per-language dictionary — but either of those paths
carries its own implementation and operational costs.

Should you nonetheless decide to press on with utf16 as a data
type, I am willing to take the patch through a proper review; I
have already applied it on top of master and confirmed that the
regression tests pass, so the mechanical footing is in place.

[1] https://w3techs.com/technologies/segmentation/tld-kr-/character_encoding
[2] https://w3techs.com/technologies/segmentation/tld-jp-/character_encoding

Best regards,
Henson

>

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Experimenting with wider Unicode storage
@ 2026-04-30 00:40  Thomas Munro <[email protected]>
  parent: Henson Choi <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Thomas Munro @ 2026-04-30 00:40 UTC (permalink / raw)
  To: [email protected]; +Cc: PostgreSQL Hackers <[email protected]>; Tatsuo Ishii <[email protected]>

On Tue, Apr 21, 2026 at 1:16 PM Henson Choi <[email protected]> wrote:
> Thank you again for sharing this exploration, and for including
> Korean in your experiment table.  Rather than comment on the
> patch itself, let me offer a ground-level report on where Korean
> encoding reality sits in April 2026, because the picture has
> shifted enough that I think it is worth entering into the record
> before this thread accumulates momentum on motivations that may
> no longer fully hold on this side of the region.

Hi Henson,

Thank you for this thoughtful and broad feedback, which provided a lot
of useful context.  I appreciated all of it, and have responses to a
couple of the most actionable paragraphs:

> One broader question, then, that I wanted to put to you: there
> are three distinct axes on which utf16 could be pursued — as a
> server character set, as a data type, or as a compression angle.
> The character-set direction runs straight into the "continuation
> byte must not look like ASCII" rule, as you already noted, and
> is therefore effectively closed on PostgreSQL.  The data-type
> direction is the current patch, which carries substantial
> catalogue and operator surface, while the storage wins mostly
> accrue on wider values — where columnar + zstd is already doing
> the work.  What still seems genuinely unaddressed in practice is
> the short-value regime: word-sized strings such as names,
> titles, cities, and tags, which fall below the TOAST compression
> threshold and therefore never see a compressor at all.  Would
> framing this as "a compression method effective on word-sized
> values" be a more productive angle than either of the other two?
> The storage outcome could be similar with much less surface area
> to maintain.

Yeah, that is an interesting angle that I hadn't considered, at least
not with that framing.  There are even a couple of Unicode standards
that might apply here, and that I believe some other systems are
using:

https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode
https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode
https://www.unicode.org/notes/tn6/

BOCU-1 maintains binary codepoint order and reports typical
English/French as no size change compared to UTF-8,
Greek/Russian/Arabic/Hebrew as -40%, Hindi as -60% (this makes sense:
it's almost a generalised ISCII, so you get down to one byte per
character in any given Indian language), Japanese as -40% and
Chinese/Korean as -25% (Japanese presumably wins with kana sequences).

One of the ideas already mentioned in comments in the experimental
patch was that the iterator abstraction could allow for incremental
decompression, and I suppose there might be a way to expand BOCU-1 or
similar to UTF-8 incrementally in that layer.  I haven't looked into
that seriously though; so far I had only been thinking of that as a
way of generalising some open coded special cases that appear in a few
places to avoid detoasting.  ICU might also be able to consume it
incrementally, IDK.

zstd etc can clearly compress much more than that, as you say, but
then you have to deal with dictionary problems and it's hard to do
that for small values in a row-oriented system, as you say.  BOCU-1 is
dictionary-free, so you read it in direct byte order with only a tiny
state in a register or two, which seems to be potentially along the
lines you're suggesting.  Food for thought.

> A fair counter on memory, before I go on: disk pressure has
> clearly migrated elsewhere, but shared_buffers and work_mem
> remain finite, and compression primarily addresses the disk
> side.  A data-type approach that goes far enough to shrink the
> in-memory representation — modifying every string function
> along the way — tends to become a degraded form of a new
> character set: doing most of the character-set work without the
> character-set slot in PostgreSQL's encoding machinery, which as
> above is closed.  None of the three axes therefore cleanly
> solves the in-memory case; for truly memory-bound CJK workloads
> the honest answer is probably just more RAM.

Yeah.  It's an annoying set of constraints that led me to consider
this, while surveying text handling choices made in lots of database
systems.  Of course it wouldn't be my preference to introduce a new
type, but I couldn't see how how else to fit it in, and since I was
already investigating "modifying every string function along the way"
for other reasons, I wanted to explore what it would take to do that
generically enough to handle something as different as this while
remaining maintainable...

BTW here is the link that I forgot to add to the bottom of my earlier
email as reference [3], which is a blog from when SQL Server
introduced the *opposite* thing: UTF-8 support (like Windows itself,
in 2019).  Previously they had only legacy single/multi-byte encodings
in VARCHAR and UTF-16 in NVARCHAR, so there they were discussing this
tradeoff in reverse, ie space savings for some languages, but reported
25% increase in disk I/O for CJK databases moved to UTF-8.  (I don't
immediately know why SCSU didn't fix that.)

https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-support-for-sql-server/734928

> Should you nonetheless decide to press on with utf16 as a data
> type, I am willing to take the patch through a proper review; I
> have already applied it on top of master and confirmed that the
> regression tests pass, so the mechanical footing is in place.

Thanks.  I'm not planning to do more with the "separate UTF-16 type"
concept at this stage, based on your feedback so far.  I am still
working on a couple of text/encoding refactoring prototypes with other
goals, and will try to think about that "special Unicode compression"
angle while doing so.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2026-04-30 00:40 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15 13:43 Experimenting with wider Unicode storage Thomas Munro <[email protected]>
2026-04-16 01:23 ` Henson Choi <[email protected]>
2026-04-21 01:16   ` Henson Choi <[email protected]>
2026-04-30 00:40     ` Thomas Munro <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox