public inbox for [email protected]  
help / color / mirror / Atom feed
Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
6+ messages / 2 participants
[nested] [flat]

* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
@ 2026-02-14 05:38  Noah Misch <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Noah Misch @ 2026-02-14 05:38 UTC (permalink / raw)
  To: [email protected]; [email protected]; +Cc: [email protected]

On Fri, Feb 13, 2026 at 04:21:13PM -0800, Noah Misch wrote:
> Review welcome.  I have a Valgrind test run ongoing.

Valgrind found the complaint below, but I think this an instrumentation
problem.  I've added a fix for that instrumentation.  I also made minor edits
to the log message of the main patch, hence v3.

The release team is preparing to announce a 2026-02-26 out-of-cycle release in
light of this regression.  I plan to push these fixes at 2026-02-14T20:00+0000
to unblock that formal announcement.


==00:00:01:11.756 3464664== VALGRINDERROR-BEGIN
==00:00:01:11.756 3464664== Unaddressable byte(s) found during client check request
==00:00:01:11.769 3464664==    at 0xC6076A: pg_mblen_with_len (mbutils.c:1115)
==00:00:01:11.769 3464664==    by 0xC07CD3: pg_mbcharcliplen_chars (varlena.c:807)
==00:00:01:11.770 3464664==    by 0xC07AAD: text_substring (varlena.c:732)
==00:00:01:11.770 3464664==    by 0xC07797: text_substr (varlena.c:553)
==00:00:01:11.770 3464664==    by 0x779688: ExecInterpExpr (execExprInterp.c:953)
==00:00:01:11.770 3464664==    by 0x77BBD6: ExecInterpExprStillValid (execExprInterp.c:2299)
==00:00:01:11.770 3464664==    by 0x7DBBC4: ExecEvalExprNoReturn (executor.h:423)
==00:00:01:11.770 3464664==    by 0x7DBC73: ExecEvalExprNoReturnSwitchContext (executor.h:464)
==00:00:01:11.770 3464664==    by 0x7DBCD3: ExecProject (executor.h:496)
==00:00:01:11.770 3464664==    by 0x7DC134: ExecScanExtended (execScan.h:234)
==00:00:01:11.770 3464664==    by 0x7DC474: ExecSeqScanWithProject (nodeSeqscan.c:162)
==00:00:01:11.770 3464664==    by 0x794561: ExecProcNodeFirst (execProcnode.c:469)
==00:00:01:11.770 3464664==  Address 0x19340a5e is 12,062 bytes inside a block of size 12,064 alloc'd
==00:00:01:11.770 3464664==    at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==00:00:01:11.770 3464664==    by 0xC79F03: AllocSetAllocLarge (aset.c:756)
==00:00:01:11.770 3464664==    by 0xC7AA4C: AllocSetAlloc (aset.c:1033)
==00:00:01:11.770 3464664==    by 0xC8B37E: palloc (mcxt.c:1408)
==00:00:01:11.770 3464664==    by 0x4B0182: detoast_attr_slice (detoast.c:324)
==00:00:01:11.770 3464664==    by 0xC5177F: pg_detoast_datum_slice (fmgr.c:1825)
==00:00:01:11.770 3464664==    by 0xC07A11: text_substring (varlena.c:716)
==00:00:01:11.770 3464664==    by 0xC07797: text_substr (varlena.c:553)
==00:00:01:11.770 3464664==    by 0x779688: ExecInterpExpr (execExprInterp.c:953)
==00:00:01:11.770 3464664==    by 0x77BBD6: ExecInterpExprStillValid (execExprInterp.c:2299)
==00:00:01:11.770 3464664==    by 0x7DBBC4: ExecEvalExprNoReturn (executor.h:423)
==00:00:01:11.770 3464664==    by 0x7DBC73: ExecEvalExprNoReturnSwitchContext (executor.h:464)
==00:00:01:11.770 3464664== 
==00:00:01:11.770 3464664== VALGRINDERROR-END
{
   <insert_a_suppression_name_here>
   Memcheck:User
   fun:pg_mblen_with_len
   fun:pg_mbcharcliplen_chars
   fun:text_substring
   fun:text_substr
   fun:ExecInterpExpr
   fun:ExecInterpExprStillValid
   fun:ExecEvalExprNoReturn
   fun:ExecEvalExprNoReturnSwitchContext
   fun:ExecProject
   fun:ExecScanExtended
   fun:ExecSeqScanWithProject
   fun:ExecProcNodeFirst
}
2026-02-13 21:03:38.905 PST client backend[3464664] pg_regress/encoding ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0x80
2026-02-13 21:03:38.905 PST client backend[3464664] pg_regress/encoding STATEMENT:  SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
**00:00:01:11.771 3464664** Valgrind detected 1 error(s) during execution of "SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;"

From: Noah Misch <[email protected]>

Fix SUBSTRING() for toasted multibyte characters.

Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed
pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an
incomplete character.  Most callers want that.  text_substring() does
not.  It detoasts the most bytes it could possibly need to get the
requested number of characters.  For example, to extract up to 2 chars
from UTF8, it needs to detoast 8 bytes.  In a string of 3-byte UTF8
chars, 8 bytes spans 2 complete chars and 1 partial char.

Fix this by replacing this pg_mbstrlen_with_len() call with a string
traversal that differs by stopping upon finding as many chars as the
substring could need.  This also makes SUBSTRING() stop raising an
encoding error if the incomplete char is past the end of the substring.
This is consistent with the general philosophy of the above commit,
which was to raise errors on a just-in-time basis.  Before the above
commit, SUBSTRING() never raised an encoding error.

SUBSTRING() has long been detoasting enough for one more char than
needed, because it did not distinguish exclusive and inclusive end
position.  For avoidance of doubt, stop detoasting extra.

Back-patch to v14, like the above commit.  For applications using
SUBSTRING() on non-ASCII column values, consider applying this to your
copy of any of the February 12, 2026 releases.

Reported-by: SATŌ Kentarō <[email protected]>
Reviewed-by: FIXME
Bug: #19406
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index dbecd71..99a5470 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -133,6 +133,7 @@ static text *text_substring(Datum str,
 							int32 start,
 							int32 length,
 							bool length_not_specified);
+static int	pg_mbcharcliplen_chars(const char *mbstr, int len, int limit);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2, Oid collid);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
@@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 	int32		S = start;		/* start position */
 	int32		S1;				/* adjusted start position */
 	int32		L1;				/* adjusted substring length */
-	int32		E;				/* end position */
+	int32		E;				/* end position, exclusive */
 
 	/*
 	 * SQL99 says S can be zero or negative (which we don't document), but we
@@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		else
 		{
 			/*
-			 * A zero or negative value for the end position can happen if the
-			 * start was negative or one. SQL99 says to return a zero-length
-			 * string.
+			 * Ending at position 1, exclusive, obviously yields an empty
+			 * string.  A zero or negative value can happen if the start was
+			 * negative or one. SQL99 says to return a zero-length string.
 			 */
-			if (E < 1)
+			if (E <= 1)
 				return cstring_to_text("");
 
 			/*
@@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 			L1 = E - S1;
 
 			/*
-			 * Total slice size in bytes can't be any longer than the start
-			 * position plus substring length times the encoding max length.
-			 * If that overflows, we can just use -1.
+			 * Total slice size in bytes can't be any longer than the
+			 * inclusive end position times the encoding max length.  If that
+			 * overflows, we can just use -1.
 			 */
-			if (pg_mul_s32_overflow(E, eml, &slice_size))
+			if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
 				slice_size = -1;
 		}
 
@@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		}
 
 		/* Now we can get the actual length of the slice in MB characters */
-		slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-											slice_len);
+		slice_strlen =
+			(slice_size == -1 ?
+			 pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+			 pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1));
 
 		/*
 		 * Check that the start position wasn't > slice_strlen. If so, SQL99
@@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 }
 
 /*
+ * pg_mbcharcliplen_chars -
+ *	Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes.
+ *
+ *	This mirrors all the dubious historical behavior, so it's static to
+ *	discourage proliferation.  The assertions are specific to the one caller.
+ */
+static int
+pg_mbcharcliplen_chars(const char *mbstr, int len, int limit)
+{
+	int			nch = 0;
+	int			l;
+
+	Assert(len > 0);
+	Assert(limit > 0);
+	Assert(pg_database_encoding_max_length() > 1);
+
+	while (len > 0 && *mbstr)
+	{
+		l = pg_mblen_with_len(mbstr, len);
+		nch++;
+		if (nch == limit)
+			break;
+		len -= l;
+		mbstr += l;
+	}
+	return nch;
+}
+
+/*
  * textoverlay
  *	Replace specified substring of first string with second
  *
diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out
index ea1f38c..cac1cb7 100644
--- a/src/test/regress/expected/encoding.out
+++ b/src/test/regress/expected/encoding.out
@@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding;
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+ substring 
+-----------
+ caf
+(1 row)
+
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
 SELECT reverse(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
@@ -375,7 +381,43 @@ NOTICE:  MULE_INTERNAL LC2:           \x908283 -> {9470595} -> \x908283 = OK
  t
 (1 row)
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ 
+(1 row)
+
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0x80
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+ substring 
+-----------
+ 🚀
+(1 row)
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;
diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql
index b9543c0..782f90f 100644
--- a/src/test/regress/sql/encoding.sql
+++ b/src/test/regress/sql/encoding.sql
@@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding;
 
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 SELECT reverse(truncated) FROM regress_encoding;
 -- invalid short mb character = silently dropped
 SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding;
@@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES
 SELECT COUNT(test_encoding(encoding, description, input)) > 0
 FROM encoding_tests;
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;

From: Noah Misch <[email protected]>

pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport.

The prior order caused spurious Valgrind errors.  They're spurious
because the ereport(ERROR) stops subsequent code from accessing the
memory in question.  pg_mblen_cstr() ordered the checks correctly, but
these other two did not.  Back-patch to v14, like commit
1e7fe06c10c0a8da9dd6261a6be8d405dc17c728.

Reviewed-by: FIXME
Discussion: https://postgr.es/m/FIXME
Backpatch-through: 14

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index a5a7348..f3f94d4 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -1086,15 +1086,16 @@ pg_mblen_range(const char *mbstr, const char *end)
 	int			length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr);
 
 	Assert(end > mbstr);
+
+	if (unlikely(mbstr + length > end))
+		report_invalid_encoding_db(mbstr, length, end - mbstr);
+
 #ifdef VALGRIND_EXPENSIVE
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, end - mbstr);
 #else
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length);
 #endif
 
-	if (unlikely(mbstr + length > end))
-		report_invalid_encoding_db(mbstr, length, end - mbstr);
-
 	return length;
 }
 
@@ -1109,15 +1110,16 @@ pg_mblen_with_len(const char *mbstr, int limit)
 	int			length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr);
 
 	Assert(limit >= 1);
+
+	if (unlikely(length > limit))
+		report_invalid_encoding_db(mbstr, length, limit);
+
 #ifdef VALGRIND_EXPENSIVE
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, limit);
 #else
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length);
 #endif
 
-	if (unlikely(length > limit))
-		report_invalid_encoding_db(mbstr, length, limit);
-
 	return length;
 }
 


Attachments:

  [text/plain] toast-slice-mblen-v3.patch (8.7K, 2-toast-slice-mblen-v3.patch)
  download | inline diff:
From: Noah Misch <[email protected]>

Fix SUBSTRING() for toasted multibyte characters.

Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed
pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an
incomplete character.  Most callers want that.  text_substring() does
not.  It detoasts the most bytes it could possibly need to get the
requested number of characters.  For example, to extract up to 2 chars
from UTF8, it needs to detoast 8 bytes.  In a string of 3-byte UTF8
chars, 8 bytes spans 2 complete chars and 1 partial char.

Fix this by replacing this pg_mbstrlen_with_len() call with a string
traversal that differs by stopping upon finding as many chars as the
substring could need.  This also makes SUBSTRING() stop raising an
encoding error if the incomplete char is past the end of the substring.
This is consistent with the general philosophy of the above commit,
which was to raise errors on a just-in-time basis.  Before the above
commit, SUBSTRING() never raised an encoding error.

SUBSTRING() has long been detoasting enough for one more char than
needed, because it did not distinguish exclusive and inclusive end
position.  For avoidance of doubt, stop detoasting extra.

Back-patch to v14, like the above commit.  For applications using
SUBSTRING() on non-ASCII column values, consider applying this to your
copy of any of the February 12, 2026 releases.

Reported-by: SATŌ Kentarō <[email protected]>
Reviewed-by: FIXME
Bug: #19406
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index dbecd71..99a5470 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -133,6 +133,7 @@ static text *text_substring(Datum str,
 							int32 start,
 							int32 length,
 							bool length_not_specified);
+static int	pg_mbcharcliplen_chars(const char *mbstr, int len, int limit);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2, Oid collid);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
@@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 	int32		S = start;		/* start position */
 	int32		S1;				/* adjusted start position */
 	int32		L1;				/* adjusted substring length */
-	int32		E;				/* end position */
+	int32		E;				/* end position, exclusive */
 
 	/*
 	 * SQL99 says S can be zero or negative (which we don't document), but we
@@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		else
 		{
 			/*
-			 * A zero or negative value for the end position can happen if the
-			 * start was negative or one. SQL99 says to return a zero-length
-			 * string.
+			 * Ending at position 1, exclusive, obviously yields an empty
+			 * string.  A zero or negative value can happen if the start was
+			 * negative or one. SQL99 says to return a zero-length string.
 			 */
-			if (E < 1)
+			if (E <= 1)
 				return cstring_to_text("");
 
 			/*
@@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 			L1 = E - S1;
 
 			/*
-			 * Total slice size in bytes can't be any longer than the start
-			 * position plus substring length times the encoding max length.
-			 * If that overflows, we can just use -1.
+			 * Total slice size in bytes can't be any longer than the
+			 * inclusive end position times the encoding max length.  If that
+			 * overflows, we can just use -1.
 			 */
-			if (pg_mul_s32_overflow(E, eml, &slice_size))
+			if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
 				slice_size = -1;
 		}
 
@@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		}
 
 		/* Now we can get the actual length of the slice in MB characters */
-		slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-											slice_len);
+		slice_strlen =
+			(slice_size == -1 ?
+			 pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+			 pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1));
 
 		/*
 		 * Check that the start position wasn't > slice_strlen. If so, SQL99
@@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 }
 
 /*
+ * pg_mbcharcliplen_chars -
+ *	Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes.
+ *
+ *	This mirrors all the dubious historical behavior, so it's static to
+ *	discourage proliferation.  The assertions are specific to the one caller.
+ */
+static int
+pg_mbcharcliplen_chars(const char *mbstr, int len, int limit)
+{
+	int			nch = 0;
+	int			l;
+
+	Assert(len > 0);
+	Assert(limit > 0);
+	Assert(pg_database_encoding_max_length() > 1);
+
+	while (len > 0 && *mbstr)
+	{
+		l = pg_mblen_with_len(mbstr, len);
+		nch++;
+		if (nch == limit)
+			break;
+		len -= l;
+		mbstr += l;
+	}
+	return nch;
+}
+
+/*
  * textoverlay
  *	Replace specified substring of first string with second
  *
diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out
index ea1f38c..cac1cb7 100644
--- a/src/test/regress/expected/encoding.out
+++ b/src/test/regress/expected/encoding.out
@@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding;
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+ substring 
+-----------
+ caf
+(1 row)
+
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
 SELECT reverse(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
@@ -375,7 +381,43 @@ NOTICE:  MULE_INTERNAL LC2:           \x908283 -> {9470595} -> \x908283 = OK
  t
 (1 row)
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ 
+(1 row)
+
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0x80
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+ substring 
+-----------
+ 🚀
+(1 row)
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;
diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql
index b9543c0..782f90f 100644
--- a/src/test/regress/sql/encoding.sql
+++ b/src/test/regress/sql/encoding.sql
@@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding;
 
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 SELECT reverse(truncated) FROM regress_encoding;
 -- invalid short mb character = silently dropped
 SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding;
@@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES
 SELECT COUNT(test_encoding(encoding, description, input)) > 0
 FROM encoding_tests;
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;


  [text/plain] mblen-valgrind-after-report-v1.patch (1.7K, 3-mblen-valgrind-after-report-v1.patch)
  download | inline diff:
From: Noah Misch <[email protected]>

pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport.

The prior order caused spurious Valgrind errors.  They're spurious
because the ereport(ERROR) stops subsequent code from accessing the
memory in question.  pg_mblen_cstr() ordered the checks correctly, but
these other two did not.  Back-patch to v14, like commit
1e7fe06c10c0a8da9dd6261a6be8d405dc17c728.

Reviewed-by: FIXME
Discussion: https://postgr.es/m/FIXME
Backpatch-through: 14

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index a5a7348..f3f94d4 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -1086,15 +1086,16 @@ pg_mblen_range(const char *mbstr, const char *end)
 	int			length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr);
 
 	Assert(end > mbstr);
+
+	if (unlikely(mbstr + length > end))
+		report_invalid_encoding_db(mbstr, length, end - mbstr);
+
 #ifdef VALGRIND_EXPENSIVE
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, end - mbstr);
 #else
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length);
 #endif
 
-	if (unlikely(mbstr + length > end))
-		report_invalid_encoding_db(mbstr, length, end - mbstr);
-
 	return length;
 }
 
@@ -1109,15 +1110,16 @@ pg_mblen_with_len(const char *mbstr, int limit)
 	int			length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr);
 
 	Assert(limit >= 1);
+
+	if (unlikely(length > limit))
+		report_invalid_encoding_db(mbstr, length, limit);
+
 #ifdef VALGRIND_EXPENSIVE
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, limit);
 #else
 	VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length);
 #endif
 
-	if (unlikely(length > limit))
-		report_invalid_encoding_db(mbstr, length, limit);
-
 	return length;
 }
 


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
@ 2026-02-14 08:15  Thomas Munro <[email protected]>
  parent: Noah Misch <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Thomas Munro @ 2026-02-14 08:15 UTC (permalink / raw)
  To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected]

On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote:
> [mblen-valgrind-after-report-v1.patch]

LGTM.  The new valgrind check should clearly be after the new non-local exit.

Studying the other patch...






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
@ 2026-02-14 19:07  Thomas Munro <[email protected]>
  parent: Thomas Munro <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Thomas Munro @ 2026-02-14 19:07 UTC (permalink / raw)
  To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected]

On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <[email protected]> wrote:
> On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote:
> > [mblen-valgrind-after-report-v1.patch]
>
> LGTM.  The new valgrind check should clearly be after the new non-local exit.
>
> Studying the other patch...

                        /*
-                        * Total slice size in bytes can't be any
longer than the start
-                        * position plus substring length times the
encoding max length.
-                        * If that overflows, we can just use -1.
+                        * Total slice size in bytes can't be any
longer than the
+                        * inclusive end position times the encoding
max length.  If that
+                        * overflows, we can just use -1.
                         */
-                       if (pg_mul_s32_overflow(E, eml, &slice_size))
+                       if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
                                slice_size = -1;

Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?

                /* Now we can get the actual length of the slice in MB
characters */
-               slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-
                 slice_len);
+               slice_strlen =
+                       (slice_size == -1 ?
+                        pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+                        pg_mbcharcliplen_chars(VARDATA_ANY(slice),
slice_len, E - 1));

Comment presumably needs adjustment to say that we only count as far
as we need to, and why.

There is something a bit strange about all this, though.
pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
that really exist past 2^29 (~500 million), you must get an empty
string, right?   That's hard to reach, pre-existing and out of scope
for the immediate problem report, except ... now we're contorting the
code even further to keep it.

The outline I had come up with before seeing your patch was: let's
just delete it.  The position search can check bounds incrementally,
following our general approach.  This avoids the reported problem by
ditching the pre-flight scan through the slice (up to 4x more
pg_mblen_XXX calls and memory access than we strictly need), and also
the special cases for empty strings since they already fall out of the
general behaviour (am I missing something?), not leaving much code
behind.  As far as I can see so far, the only user-visible side-effect
requires corruption: substring() moves from the
internal-NUL-is-terminator category to internal-NUL-is-character
category, but that's an implementation detail.

When I saw your patch yesterday, I initially abandoned the thought,
thinking that your idea looked more conservative, but after sleeping
on it and reflecting again on these oddities, I have merged my draft
implementation with your tests, ancient detoasting fence post
observation and commit message, just to see if you think this approach
might be worth considering further.


Attachments:

  [application/x-patch] v2tm-0001-Fix-SUBSTRING-for-toasted-multibyte-characters.patch (13.0K, 2-v2tm-0001-Fix-SUBSTRING-for-toasted-multibyte-characters.patch)
  download | inline diff:
From 9f03e4d23356e8012e8a190bddd81059dc88471b Mon Sep 17 00:00:00 2001
From: Noah Misch <[email protected]>
Date: Sat, 14 Feb 2026 21:07:38 +1300
Subject: [PATCH v2tm 1/2] Fix SUBSTRING() for toasted multibyte characters.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed
pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an
incomplete character.  Most callers want that.  text_substring() did
not.  It detoasts the most bytes it could possibly need to get the
requested number of characters.  For example, to extract up to 2 chars
from UTF8, it needs to detoast 8 bytes.  In a string of 3-byte UTF8
chars, 8 bytes spans 2 complete chars and 1 partial char.

Fix by removing the pg_mbstrlen_with_len() call.  An encoding error is
still raised if input runs out in the middle of a multi-byte sequence
that is needed for the result, but that can be detected incrementally.

This is consistent with the general philosophy of the above commit,
which was to raise errors on a just-in-time basis.  Before the above
commit, SUBSTRING() never raised an encoding error.

SUBSTRING() has long been detoasting enough for one more char than
needed, because it did not distinguish exclusive and inclusive end
position.  For avoidance of doubt, stop detoasting extra.

A long standing overflow bug affecting positions exceeding 2^29 is also
addressed, though such large numbers would be unlikely to be used in
real applications.

Back-patch to v14, like the above commit.  For applications using
SUBSTRING() on non-ASCII column values, consider applying this to your
copy of any of the February 12, 2026 releases.

Reported-by: SATŌ Kentarō <[email protected]>
Reviewed-by: FIXME
Bug: #19406
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14
---
 src/backend/utils/adt/varlena.c        | 103 +++++++------------------
 src/test/regress/expected/encoding.out |  72 ++++++++++++-----
 src/test/regress/sql/encoding.sql      |  26 +++++--
 3 files changed, 102 insertions(+), 99 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index dbecd7160d6..5e3951ab0e1 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -586,7 +586,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 	int32		S = start;		/* start position */
 	int32		S1;				/* adjusted start position */
 	int32		L1;				/* adjusted substring length */
-	int32		E;				/* end position */
+	int32		E;				/* end position, exclusive */
 
 	/*
 	 * SQL99 says S can be zero or negative (which we don't document), but we
@@ -644,34 +644,27 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		 * detoasting, so we'll grab a conservatively large slice now and go
 		 * back later to do the right thing
 		 */
-		int32		slice_start;
 		int32		slice_size;
-		int32		slice_strlen;
-		int32		slice_len;
+		const char *slice_end;
 		text	   *slice;
-		int32		E1;
 		int32		i;
 		char	   *p;
 		char	   *s;
 		text	   *ret;
 
-		/*
-		 * We need to start at position zero because there is no way to know
-		 * in advance which byte offset corresponds to the supplied start
-		 * position.
-		 */
-		slice_start = 0;
-
-		if (length_not_specified)	/* special case - get length to end of
-									 * string */
-			slice_size = L1 = -1;
+		if (length_not_specified)	/* special case - get whole string */
+		{
+			slice_size = -1;
+			E = PG_INT32_MAX;
+		}
 		else if (length < 0)
 		{
 			/* SQL99 says to throw an error for E < S, i.e., negative length */
 			ereport(ERROR,
 					(errcode(ERRCODE_SUBSTRING_ERROR),
 					 errmsg("negative substring length not allowed")));
-			slice_size = L1 = -1;	/* silence stupider compilers */
+			slice_size = -1;	/* silence stupider compilers */
+			E = PG_INT32_MAX;
 		}
 		else if (pg_add_s32_overflow(S, length, &E))
 		{
@@ -679,30 +672,17 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 			 * L could be large enough for S + L to overflow, in which case
 			 * the substring must run to end of string.
 			 */
-			slice_size = L1 = -1;
+			slice_size = -1;
+			E = PG_INT32_MAX;
 		}
 		else
 		{
 			/*
-			 * A zero or negative value for the end position can happen if the
-			 * start was negative or one. SQL99 says to return a zero-length
-			 * string.
-			 */
-			if (E < 1)
-				return cstring_to_text("");
-
-			/*
-			 * if E is past the end of the string, the tuple toaster will
-			 * truncate the length for us
+			 * Total slice size in bytes can't be any longer than the end
+			 * position (made zero-based) times the encoding max length. If
+			 * that overflows, we can just use -1.
 			 */
-			L1 = E - S1;
-
-			/*
-			 * Total slice size in bytes can't be any longer than the start
-			 * position plus substring length times the encoding max length.
-			 * If that overflows, we can just use -1.
-			 */
-			if (pg_mul_s32_overflow(E, eml, &slice_size))
+			if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
 				slice_size = -1;
 		}
 
@@ -712,63 +692,34 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		 */
 		if (VARATT_IS_COMPRESSED(DatumGetPointer(str)) ||
 			VARATT_IS_EXTERNAL(DatumGetPointer(str)))
-			slice = DatumGetTextPSlice(str, slice_start, slice_size);
+			slice = DatumGetTextPSlice(str, 0, slice_size);
 		else
 			slice = (text *) DatumGetPointer(str);
 
-		/* see if we got back an empty string */
-		slice_len = VARSIZE_ANY_EXHDR(slice);
-		if (slice_len == 0)
-		{
-			if (slice != (text *) DatumGetPointer(str))
-				pfree(slice);
-			return cstring_to_text("");
-		}
-
-		/* Now we can get the actual length of the slice in MB characters */
-		slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-											slice_len);
-
-		/*
-		 * Check that the start position wasn't > slice_strlen. If so, SQL99
-		 * says to return a zero-length string.
-		 */
-		if (S1 > slice_strlen)
-		{
-			if (slice != (text *) DatumGetPointer(str))
-				pfree(slice);
-			return cstring_to_text("");
-		}
-
-		/*
-		 * Adjust L1 and E1 now that we know the slice string length. Again
-		 * remember that S1 is one based, and slice_start is zero based.
-		 */
-		if (L1 > -1)
-			E1 = Min(S1 + L1, slice_start + 1 + slice_strlen);
-		else
-			E1 = slice_start + 1 + slice_strlen;
+		p = VARDATA_ANY(slice);
+		slice_end = p + VARSIZE_ANY_EXHDR(slice);
 
 		/*
-		 * Find the start position in the slice; remember S1 is not zero based
+		 * Find the start position in the slice; remember S1 is not zero
+		 * based. If we run out of input first, we'll return an empty string.
 		 */
-		p = VARDATA_ANY(slice);
-		for (i = 0; i < S1 - 1; i++)
-			p += pg_mblen_unbounded(p);
+		for (i = 1; i < S1 && p < slice_end; i++)
+			p += pg_mblen_range(p, slice_end);
 
 		/* hang onto a pointer to our start position */
 		s = p;
 
 		/*
 		 * Count the actual bytes used by the substring of the requested
-		 * length.
+		 * length, again giving up if we run out of input.
 		 */
-		for (i = S1; i < E1; i++)
-			p += pg_mblen_unbounded(p);
+		for (i = S1; i < E && p < slice_end; i++)
+			p += pg_mblen_range(p, slice_end);
 
 		ret = (text *) palloc(VARHDRSZ + (p - s));
 		SET_VARSIZE(ret, VARHDRSZ + (p - s));
-		memcpy(VARDATA(ret), s, (p - s));
+		if (s < p)
+			memcpy(VARDATA(ret), s, (p - s));
 
 		if (slice != (text *) DatumGetPointer(str))
 			pfree(slice);
diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out
index ea1f38cff41..eec879b05ec 100644
--- a/src/test/regress/expected/encoding.out
+++ b/src/test/regress/expected/encoding.out
@@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding;
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+ substring 
+-----------
+ caf
+(1 row)
+
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
 SELECT reverse(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
@@ -84,6 +90,13 @@ SELECT length(with_nul) FROM regress_encoding;
       4
 (1 row)
 
+SELECT regexp_replace(with_nul, '^caf(.)$', '\1') FROM regress_encoding;
+ regexp_replace 
+----------------
+ é
+(1 row)
+
+-- NUL = character
 SELECT substring(with_nul, 3, 1) FROM regress_encoding;
  substring 
 -----------
@@ -96,25 +109,12 @@ SELECT substring(with_nul, 4, 1) FROM regress_encoding;
  é
 (1 row)
 
-SELECT substring(with_nul, 5, 1) FROM regress_encoding;
- substring 
------------
- 
-(1 row)
-
-SELECT convert_to(substring(with_nul, 5, 1), 'UTF8') FROM regress_encoding;
- convert_to 
-------------
- \x
-(1 row)
-
-SELECT regexp_replace(with_nul, '^caf(.)$', '\1') FROM regress_encoding;
- regexp_replace 
-----------------
- é
+SELECT substring(with_nul, 5, 1) = test_bytea_to_text('\x00') FROM regress_encoding;
+ ?column? 
+----------
+ t
 (1 row)
 
--- NUL = character
 SELECT with_nul, reverse(with_nul), reverse(reverse(with_nul)) FROM regress_encoding;
  with_nul | reverse | reverse 
 ----------+---------+---------
@@ -375,7 +375,43 @@ NOTICE:  MULE_INTERNAL LC2:           \x908283 -> {9470595} -> \x908283 = OK
  t
 (1 row)
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ 
+(1 row)
+
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0x80
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+ substring 
+-----------
+ 🚀
+(1 row)
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;
diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql
index b9543c0cb32..7de95f49947 100644
--- a/src/test/regress/sql/encoding.sql
+++ b/src/test/regress/sql/encoding.sql
@@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding;
 
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 SELECT reverse(truncated) FROM regress_encoding;
 -- invalid short mb character = silently dropped
 SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding;
@@ -51,12 +52,11 @@ SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding;
 
 -- NUL = terminator
 SELECT length(with_nul) FROM regress_encoding;
-SELECT substring(with_nul, 3, 1) FROM regress_encoding;
-SELECT substring(with_nul, 4, 1) FROM regress_encoding;
-SELECT substring(with_nul, 5, 1) FROM regress_encoding;
-SELECT convert_to(substring(with_nul, 5, 1), 'UTF8') FROM regress_encoding;
 SELECT regexp_replace(with_nul, '^caf(.)$', '\1') FROM regress_encoding;
 -- NUL = character
+SELECT substring(with_nul, 3, 1) FROM regress_encoding;
+SELECT substring(with_nul, 4, 1) FROM regress_encoding;
+SELECT substring(with_nul, 5, 1) = test_bytea_to_text('\x00') FROM regress_encoding;
 SELECT with_nul, reverse(with_nul), reverse(reverse(with_nul)) FROM regress_encoding;
 
 -- If a corrupted string contains NUL in the tail bytes of a multibyte
@@ -212,7 +212,23 @@ INSERT INTO encoding_tests VALUES
 SELECT COUNT(test_encoding(encoding, description, input)) > 0
 FROM encoding_tests;
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
@ 2026-02-14 19:33  Noah Misch <[email protected]>
  parent: Thomas Munro <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Noah Misch @ 2026-02-14 19:33 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: [email protected]; [email protected]

On Sun, Feb 15, 2026 at 08:07:22AM +1300, Thomas Munro wrote:
> On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <[email protected]> wrote:
> > On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote:
> > > [mblen-valgrind-after-report-v1.patch]
> >
> > LGTM.  The new valgrind check should clearly be after the new non-local exit.
> >
> > Studying the other patch...

Thank you.

>                         /*
> -                        * Total slice size in bytes can't be any
> longer than the start
> -                        * position plus substring length times the
> encoding max length.
> -                        * If that overflows, we can just use -1.
> +                        * Total slice size in bytes can't be any
> longer than the
> +                        * inclusive end position times the encoding
> max length.  If that
> +                        * overflows, we can just use -1.
>                          */
> -                       if (pg_mul_s32_overflow(E, eml, &slice_size))
> +                       if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
>                                 slice_size = -1;
> 
> Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?

Since it's discrete (being an integer), "up to E, exclusive" and "up to E - 1,
inclusive" are the same thing.  My comment may not be the optimal way to
express that.

>                 /* Now we can get the actual length of the slice in MB
> characters */
> -               slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
> -
>                  slice_len);
> +               slice_strlen =
> +                       (slice_size == -1 ?
> +                        pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
> +                        pg_mbcharcliplen_chars(VARDATA_ANY(slice),
> slice_len, E - 1));
> 
> Comment presumably needs adjustment to say that we only count as far
> as we need to, and why.

Changed to:

		/*
		 * Now we can get the actual length of the slice in MB characters,
		 * stopping at the end of the substring.  Continuing beyond the
		 * substring end could find an incomplete character attributable
		 * solely to DatumGetTextPSlice() chopping in the middle of a
		 * character, and it would be superfluous work at best.
		 */

> There is something a bit strange about all this, though.
> pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
> that really exist past 2^29 (~500 million), you must get an empty
> string, right?   That's hard to reach, pre-existing and out of scope
> for the immediate problem report, except ... now we're contorting the
> code even further to keep it.

- slice_size is the amount we *requested* from the toaster.  It can be -1,
  which retrieves the max available.
- slice_len is the amount *returned* from the toaster.  It's nonnegative.

Does a behavior specific to strings >2^29 still exist?

> The outline I had come up with before seeing your patch was: let's
> just delete it.  The position search can check bounds incrementally,
> following our general approach.  This avoids the reported problem by
> ditching the pre-flight scan through the slice (up to 4x more
> pg_mblen_XXX calls and memory access than we strictly need), and also
> the special cases for empty strings since they already fall out of the
> general behaviour (am I missing something?), not leaving much code
> behind.

Like you, I made a note that it's wasteful to make two mblen passes over the
string.  I'm only seeing a 50% reduction in mblen calls, not an 80% reduction,
but I didn't look too closely.  I guessed such a change would be less clearly
correct, so I figured it would be less suitable for back branches.  Hence, I
didn't draft it.

> As far as I can see so far, the only user-visible side-effect
> requires corruption: substring() moves from the
> internal-NUL-is-terminator category to internal-NUL-is-character
> category, but that's an implementation detail.

That does carry some risk, not necessarily too much to accept.

> When I saw your patch yesterday, I initially abandoned the thought,
> thinking that your idea looked more conservative, but after sleeping
> on it and reflecting again on these oddities, I have merged my draft
> implementation with your tests, ancient detoasting fence post
> observation and commit message, just to see if you think this approach
> might be worth considering further.

My first impression, hurried due to the commit ETA in 30 minutes, is that this
is less conservative and should hold for master-only.






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
@ 2026-02-14 20:10  Thomas Munro <[email protected]>
  parent: Noah Misch <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Thomas Munro @ 2026-02-14 20:10 UTC (permalink / raw)
  To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected]

On Sun, Feb 15, 2026 at 8:33 AM Noah Misch <[email protected]> wrote:
> - slice_len is the amount *returned* from the toaster.  It's nonnegative.

Ah, right, that makes more sense.

> > The outline I had come up with before seeing your patch was: let's
> > just delete it.  The position search can check bounds incrementally,
> > following our general approach.  This avoids the reported problem by
> > ditching the pre-flight scan through the slice (up to 4x more
> > pg_mblen_XXX calls and memory access than we strictly need), and also
> > the special cases for empty strings since they already fall out of the
> > general behaviour (am I missing something?), not leaving much code
> > behind.
>
> Like you, I made a note that it's wasteful to make two mblen passes over the
> string.  I'm only seeing a 50% reduction in mblen calls, not an 80% reduction,
> but I didn't look too closely.  I guessed such a change would be less clearly
> correct, so I figured it would be less suitable for back branches.  Hence, I
> didn't draft it.

I was comparing to unpatched master, but yeah of course your patch
already gets part of the way there.

> My first impression, hurried due to the commit ETA in 30 minutes, is that this
> is less conservative and should hold for master-only.

Got it.  Will add it to the pile of master-only fallout from this area.






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
@ 2026-02-14 20:19  Thomas Munro <[email protected]>
  parent: Thomas Munro <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Thomas Munro @ 2026-02-14 20:19 UTC (permalink / raw)
  To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected]

Also, LGTM.






^ permalink  raw  reply  [nested|flat] 6+ messages in thread


end of thread, other threads:[~2026-02-14 20:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]>
2026-02-14 08:15 ` Thomas Munro <[email protected]>
2026-02-14 19:07   ` Thomas Munro <[email protected]>
2026-02-14 19:33     ` Noah Misch <[email protected]>
2026-02-14 20:10       ` Thomas Munro <[email protected]>
2026-02-14 20:19         ` Thomas Munro <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox