Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Noah Misch <[email protected]>
To: [email protected]
To: [email protected]
Cc: [email protected]
Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
Date: Fri, 13 Feb 2026 16:21:13 -0800
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>

On Fri, Feb 13, 2026 at 02:48:04PM -0800, Noah Misch wrote:
> On Fri, Feb 13, 2026 at 09:27:02AM -0800, Noah Misch wrote:
> > On Fri, Feb 13, 2026 at 07:46:22AM +0000, PG Bug reporting form wrote:
> > > After upgrading from PostgreSQL 15.15 to 15.16, substring(text) raises:
> > > >ERROR: invalid byte sequence for encoding "UTF8": 0xe6 0x97
> > > on valid UTF-8 text stored in a TOAST-compressed column.
> > 
> > > user=> select substring(data from 1 for 1) from toast_repro;
> > > ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xe6 0x97
> > 
> > Thanks for the report.  That is a bug and a regression; I regret missing it
> > during review.  The substring operation works by taking a 4-byte slice from
> > the toasted value (4 bytes being the max length of a UTF8 char in PostgreSQL),
> > the finding the actual first character within those bytes.  However, it
> > incorrectly requires those four bytes to be a valid UTF8 string.  I'll start
> > on a fix.
> 
> Attached.  I may add some more tests, e.g. a toasted invalid string where the
> detoasted length is less than the slice we request.

Tests already covered that in particular, but I added some other tests.  I
think this is now ready.  Review welcome.  I have a Valgrind test run ongoing.

From: Noah Misch <[email protected]>

Fix SUBSTRING() for toasted multibyte characters.

Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed
pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an
incomplete character.  Most callers want that.  text_substring() does
not.  It detoasts the most bytes it could possibly need to get the
requested number of characters.  For example, to extract up to 2 chars
from UTF8, it needs to detoast 8 bytes.  In a string of 3-byte UTF8
chars, that yields 2 complete chars and 1 partial char.

Fix this by replacing this pg_mbstrlen_with_len() call with a string
traversal that differs by stopping upon finding as many chars as the
substring could need.  This also makes SUBSTRING() stop raising an
encoding error if the incomplete char is past the end of the substring.
This is consistent with the general philosophy of the above commit,
which was to raise errors on a just-in-time basis.  Before the above
commit, SUBSTRING() never raised an encoding error.

SUBSTRING() has long been detoasting enough for one more char than
needed, because it did not distinguish exclusive and inclusive end
position.  Fix that incidentally.  That and stopping the char count
early might improve performance.

Back-patch to v14 (all supported versions).

Reported-by: SATŌ Kentarō <[email protected]>
Reviewed-by: FIXME
Bug: #19406
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index dbecd71..99a5470 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -133,6 +133,7 @@ static text *text_substring(Datum str,
 							int32 start,
 							int32 length,
 							bool length_not_specified);
+static int	pg_mbcharcliplen_chars(const char *mbstr, int len, int limit);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2, Oid collid);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
@@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 	int32		S = start;		/* start position */
 	int32		S1;				/* adjusted start position */
 	int32		L1;				/* adjusted substring length */
-	int32		E;				/* end position */
+	int32		E;				/* end position, exclusive */
 
 	/*
 	 * SQL99 says S can be zero or negative (which we don't document), but we
@@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		else
 		{
 			/*
-			 * A zero or negative value for the end position can happen if the
-			 * start was negative or one. SQL99 says to return a zero-length
-			 * string.
+			 * Ending at position 1, exclusive, obviously yields an empty
+			 * string.  A zero or negative value can happen if the start was
+			 * negative or one. SQL99 says to return a zero-length string.
 			 */
-			if (E < 1)
+			if (E <= 1)
 				return cstring_to_text("");
 
 			/*
@@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 			L1 = E - S1;
 
 			/*
-			 * Total slice size in bytes can't be any longer than the start
-			 * position plus substring length times the encoding max length.
-			 * If that overflows, we can just use -1.
+			 * Total slice size in bytes can't be any longer than the
+			 * inclusive end position times the encoding max length.  If that
+			 * overflows, we can just use -1.
 			 */
-			if (pg_mul_s32_overflow(E, eml, &slice_size))
+			if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
 				slice_size = -1;
 		}
 
@@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		}
 
 		/* Now we can get the actual length of the slice in MB characters */
-		slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-											slice_len);
+		slice_strlen =
+			(slice_size == -1 ?
+			 pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+			 pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1));
 
 		/*
 		 * Check that the start position wasn't > slice_strlen. If so, SQL99
@@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 }
 
 /*
+ * pg_mbcharcliplen_chars -
+ *	Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes.
+ *
+ *	This mirrors all the dubious historical behavior, so it's static to
+ *	discourage proliferation.  The assertions are specific to the one caller.
+ */
+static int
+pg_mbcharcliplen_chars(const char *mbstr, int len, int limit)
+{
+	int			nch = 0;
+	int			l;
+
+	Assert(len > 0);
+	Assert(limit > 0);
+	Assert(pg_database_encoding_max_length() > 1);
+
+	while (len > 0 && *mbstr)
+	{
+		l = pg_mblen_with_len(mbstr, len);
+		nch++;
+		if (nch == limit)
+			break;
+		len -= l;
+		mbstr += l;
+	}
+	return nch;
+}
+
+/*
  * textoverlay
  *	Replace specified substring of first string with second
  *
diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out
index ea1f38c..cac1cb7 100644
--- a/src/test/regress/expected/encoding.out
+++ b/src/test/regress/expected/encoding.out
@@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding;
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+ substring 
+-----------
+ caf
+(1 row)
+
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
 SELECT reverse(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
@@ -375,7 +381,43 @@ NOTICE:  MULE_INTERNAL LC2:           \x908283 -> {9470595} -> \x908283 = OK
  t
 (1 row)
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ 
+(1 row)
+
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0x80
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+ substring 
+-----------
+ 🚀
+(1 row)
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;
diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql
index b9543c0..782f90f 100644
--- a/src/test/regress/sql/encoding.sql
+++ b/src/test/regress/sql/encoding.sql
@@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding;
 
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 SELECT reverse(truncated) FROM regress_encoding;
 -- invalid short mb character = silently dropped
 SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding;
@@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES
 SELECT COUNT(test_encoding(encoding, description, input)) > 0
 FROM encoding_tests;
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;


Attachments:

  [text/plain] toast-slice-mblen-v2.patch (8.6K, 2-toast-slice-mblen-v2.patch)
  download | inline diff:
From: Noah Misch <[email protected]>

Fix SUBSTRING() for toasted multibyte characters.

Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed
pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an
incomplete character.  Most callers want that.  text_substring() does
not.  It detoasts the most bytes it could possibly need to get the
requested number of characters.  For example, to extract up to 2 chars
from UTF8, it needs to detoast 8 bytes.  In a string of 3-byte UTF8
chars, that yields 2 complete chars and 1 partial char.

Fix this by replacing this pg_mbstrlen_with_len() call with a string
traversal that differs by stopping upon finding as many chars as the
substring could need.  This also makes SUBSTRING() stop raising an
encoding error if the incomplete char is past the end of the substring.
This is consistent with the general philosophy of the above commit,
which was to raise errors on a just-in-time basis.  Before the above
commit, SUBSTRING() never raised an encoding error.

SUBSTRING() has long been detoasting enough for one more char than
needed, because it did not distinguish exclusive and inclusive end
position.  Fix that incidentally.  That and stopping the char count
early might improve performance.

Back-patch to v14 (all supported versions).

Reported-by: SATŌ Kentarō <[email protected]>
Reviewed-by: FIXME
Bug: #19406
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index dbecd71..99a5470 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -133,6 +133,7 @@ static text *text_substring(Datum str,
 							int32 start,
 							int32 length,
 							bool length_not_specified);
+static int	pg_mbcharcliplen_chars(const char *mbstr, int len, int limit);
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2, Oid collid);
 static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state);
@@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 	int32		S = start;		/* start position */
 	int32		S1;				/* adjusted start position */
 	int32		L1;				/* adjusted substring length */
-	int32		E;				/* end position */
+	int32		E;				/* end position, exclusive */
 
 	/*
 	 * SQL99 says S can be zero or negative (which we don't document), but we
@@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		else
 		{
 			/*
-			 * A zero or negative value for the end position can happen if the
-			 * start was negative or one. SQL99 says to return a zero-length
-			 * string.
+			 * Ending at position 1, exclusive, obviously yields an empty
+			 * string.  A zero or negative value can happen if the start was
+			 * negative or one. SQL99 says to return a zero-length string.
 			 */
-			if (E < 1)
+			if (E <= 1)
 				return cstring_to_text("");
 
 			/*
@@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 			L1 = E - S1;
 
 			/*
-			 * Total slice size in bytes can't be any longer than the start
-			 * position plus substring length times the encoding max length.
-			 * If that overflows, we can just use -1.
+			 * Total slice size in bytes can't be any longer than the
+			 * inclusive end position times the encoding max length.  If that
+			 * overflows, we can just use -1.
 			 */
-			if (pg_mul_s32_overflow(E, eml, &slice_size))
+			if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
 				slice_size = -1;
 		}
 
@@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 		}
 
 		/* Now we can get the actual length of the slice in MB characters */
-		slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-											slice_len);
+		slice_strlen =
+			(slice_size == -1 ?
+			 pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+			 pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1));
 
 		/*
 		 * Check that the start position wasn't > slice_strlen. If so, SQL99
@@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified)
 }
 
 /*
+ * pg_mbcharcliplen_chars -
+ *	Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes.
+ *
+ *	This mirrors all the dubious historical behavior, so it's static to
+ *	discourage proliferation.  The assertions are specific to the one caller.
+ */
+static int
+pg_mbcharcliplen_chars(const char *mbstr, int len, int limit)
+{
+	int			nch = 0;
+	int			l;
+
+	Assert(len > 0);
+	Assert(limit > 0);
+	Assert(pg_database_encoding_max_length() > 1);
+
+	while (len > 0 && *mbstr)
+	{
+		l = pg_mblen_with_len(mbstr, len);
+		nch++;
+		if (nch == limit)
+			break;
+		len -= l;
+		mbstr += l;
+	}
+	return nch;
+}
+
+/*
  * textoverlay
  *	Replace specified substring of first string with second
  *
diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out
index ea1f38c..cac1cb7 100644
--- a/src/test/regress/expected/encoding.out
+++ b/src/test/regress/expected/encoding.out
@@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding;
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+ substring 
+-----------
+ caf
+(1 row)
+
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
 SELECT reverse(truncated) FROM regress_encoding;
 ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
@@ -375,7 +381,43 @@ NOTICE:  MULE_INTERNAL LC2:           \x908283 -> {9470595} -> \x908283 = OK
  t
 (1 row)
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ 
+(1 row)
+
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+ substring 
+-----------
+ …
+(1 row)
+
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0x80
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+ substring 
+-----------
+ 🚀
+(1 row)
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;
diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql
index b9543c0..782f90f 100644
--- a/src/test/regress/sql/encoding.sql
+++ b/src/test/regress/sql/encoding.sql
@@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding;
 
 -- invalid short mb character = error
 SELECT length(truncated) FROM regress_encoding;
-SELECT substring(truncated, 1, 1) FROM regress_encoding;
+SELECT substring(truncated, 1, 3) FROM regress_encoding;
+SELECT substring(truncated, 1, 4) FROM regress_encoding;
 SELECT reverse(truncated) FROM regress_encoding;
 -- invalid short mb character = silently dropped
 SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding;
@@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES
 SELECT COUNT(test_encoding(encoding, description, input)) > 0
 FROM encoding_tests;
 
+-- substring fetches a slice of a toasted value; unused tail of that slice is
+-- an incomplete char (bug #19406)
+CREATE TABLE toast_3b_utf8 (c text);
+INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000));
+SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- diagnose incomplete char iff within the substring
+UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280');
+SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8;
+SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;
+-- substring needing last byte of its slice_size
+ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8;
+UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000);
+SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8;
+
 DROP TABLE encoding_tests;
+DROP TABLE toast_4b_utf8;
 DROP FUNCTION test_encoding;
 DROP FUNCTION test_text_to_wchars;
 DROP FUNCTION test_mblen_func;

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox