public inbox for [email protected]help / color / mirror / Atom feed
Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 6+ messages / 2 participants [nested] [flat]
* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 @ 2026-02-14 05:38 Noah Misch <[email protected]> 2026-02-14 08:15 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Noah Misch @ 2026-02-14 05:38 UTC (permalink / raw) To: [email protected]; [email protected]; +Cc: [email protected] On Fri, Feb 13, 2026 at 04:21:13PM -0800, Noah Misch wrote: > Review welcome. I have a Valgrind test run ongoing. Valgrind found the complaint below, but I think this an instrumentation problem. I've added a fix for that instrumentation. I also made minor edits to the log message of the main patch, hence v3. The release team is preparing to announce a 2026-02-26 out-of-cycle release in light of this regression. I plan to push these fixes at 2026-02-14T20:00+0000 to unblock that formal announcement. ==00:00:01:11.756 3464664== VALGRINDERROR-BEGIN ==00:00:01:11.756 3464664== Unaddressable byte(s) found during client check request ==00:00:01:11.769 3464664== at 0xC6076A: pg_mblen_with_len (mbutils.c:1115) ==00:00:01:11.769 3464664== by 0xC07CD3: pg_mbcharcliplen_chars (varlena.c:807) ==00:00:01:11.770 3464664== by 0xC07AAD: text_substring (varlena.c:732) ==00:00:01:11.770 3464664== by 0xC07797: text_substr (varlena.c:553) ==00:00:01:11.770 3464664== by 0x779688: ExecInterpExpr (execExprInterp.c:953) ==00:00:01:11.770 3464664== by 0x77BBD6: ExecInterpExprStillValid (execExprInterp.c:2299) ==00:00:01:11.770 3464664== by 0x7DBBC4: ExecEvalExprNoReturn (executor.h:423) ==00:00:01:11.770 3464664== by 0x7DBC73: ExecEvalExprNoReturnSwitchContext (executor.h:464) ==00:00:01:11.770 3464664== by 0x7DBCD3: ExecProject (executor.h:496) ==00:00:01:11.770 3464664== by 0x7DC134: ExecScanExtended (execScan.h:234) ==00:00:01:11.770 3464664== by 0x7DC474: ExecSeqScanWithProject (nodeSeqscan.c:162) ==00:00:01:11.770 3464664== by 0x794561: ExecProcNodeFirst (execProcnode.c:469) ==00:00:01:11.770 3464664== Address 0x19340a5e is 12,062 bytes inside a block of size 12,064 alloc'd ==00:00:01:11.770 3464664== at 0x4C29F73: malloc (vg_replace_malloc.c:309) ==00:00:01:11.770 3464664== by 0xC79F03: AllocSetAllocLarge (aset.c:756) ==00:00:01:11.770 3464664== by 0xC7AA4C: AllocSetAlloc (aset.c:1033) ==00:00:01:11.770 3464664== by 0xC8B37E: palloc (mcxt.c:1408) ==00:00:01:11.770 3464664== by 0x4B0182: detoast_attr_slice (detoast.c:324) ==00:00:01:11.770 3464664== by 0xC5177F: pg_detoast_datum_slice (fmgr.c:1825) ==00:00:01:11.770 3464664== by 0xC07A11: text_substring (varlena.c:716) ==00:00:01:11.770 3464664== by 0xC07797: text_substr (varlena.c:553) ==00:00:01:11.770 3464664== by 0x779688: ExecInterpExpr (execExprInterp.c:953) ==00:00:01:11.770 3464664== by 0x77BBD6: ExecInterpExprStillValid (execExprInterp.c:2299) ==00:00:01:11.770 3464664== by 0x7DBBC4: ExecEvalExprNoReturn (executor.h:423) ==00:00:01:11.770 3464664== by 0x7DBC73: ExecEvalExprNoReturnSwitchContext (executor.h:464) ==00:00:01:11.770 3464664== ==00:00:01:11.770 3464664== VALGRINDERROR-END { <insert_a_suppression_name_here> Memcheck:User fun:pg_mblen_with_len fun:pg_mbcharcliplen_chars fun:text_substring fun:text_substr fun:ExecInterpExpr fun:ExecInterpExprStillValid fun:ExecEvalExprNoReturn fun:ExecEvalExprNoReturnSwitchContext fun:ExecProject fun:ExecScanExtended fun:ExecSeqScanWithProject fun:ExecProcNodeFirst } 2026-02-13 21:03:38.905 PST client backend[3464664] pg_regress/encoding ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 2026-02-13 21:03:38.905 PST client backend[3464664] pg_regress/encoding STATEMENT: SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; **00:00:01:11.771 3464664** Valgrind detected 1 error(s) during execution of "SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;" From: Noah Misch <[email protected]> Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: FIXME Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index dbecd71..99a5470 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -133,6 +133,7 @@ static text *text_substring(Datum str, int32 start, int32 length, bool length_not_specified); +static int pg_mbcharcliplen_chars(const char *mbstr, int len, int limit); static text *text_overlay(text *t1, text *t2, int sp, int sl); static int text_position(text *t1, text *t2, Oid collid); static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state); @@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) int32 S = start; /* start position */ int32 S1; /* adjusted start position */ int32 L1; /* adjusted substring length */ - int32 E; /* end position */ + int32 E; /* end position, exclusive */ /* * SQL99 says S can be zero or negative (which we don't document), but we @@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) else { /* - * A zero or negative value for the end position can happen if the - * start was negative or one. SQL99 says to return a zero-length - * string. + * Ending at position 1, exclusive, obviously yields an empty + * string. A zero or negative value can happen if the start was + * negative or one. SQL99 says to return a zero-length string. */ - if (E < 1) + if (E <= 1) return cstring_to_text(""); /* @@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) L1 = E - S1; /* - * Total slice size in bytes can't be any longer than the start - * position plus substring length times the encoding max length. - * If that overflows, we can just use -1. + * Total slice size in bytes can't be any longer than the + * inclusive end position times the encoding max length. If that + * overflows, we can just use -1. */ - if (pg_mul_s32_overflow(E, eml, &slice_size)) + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) slice_size = -1; } @@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* Now we can get the actual length of the slice in MB characters */ - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), - slice_len); + slice_strlen = + (slice_size == -1 ? + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) : + pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1)); /* * Check that the start position wasn't > slice_strlen. If so, SQL99 @@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* + * pg_mbcharcliplen_chars - + * Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes. + * + * This mirrors all the dubious historical behavior, so it's static to + * discourage proliferation. The assertions are specific to the one caller. + */ +static int +pg_mbcharcliplen_chars(const char *mbstr, int len, int limit) +{ + int nch = 0; + int l; + + Assert(len > 0); + Assert(limit > 0); + Assert(pg_database_encoding_max_length() > 1); + + while (len > 0 && *mbstr) + { + l = pg_mblen_with_len(mbstr, len); + nch++; + if (nch == limit) + break; + len -= l; + mbstr += l; + } + return nch; +} + +/* * textoverlay * Replace specified substring of first string with second * diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out index ea1f38c..cac1cb7 100644 --- a/src/test/regress/expected/encoding.out +++ b/src/test/regress/expected/encoding.out @@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; + substring +----------- + caf +(1 row) + +SELECT substring(truncated, 1, 4) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 SELECT reverse(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 @@ -375,7 +381,43 @@ NOTICE: MULE_INTERNAL LC2: \x908283 -> {9470595} -> \x908283 = OK t (1 row) +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; + substring +----------- + +(1 row) + +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + substring +----------- + 🚀 +(1 row) + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql index b9543c0..782f90f 100644 --- a/src/test/regress/sql/encoding.sql +++ b/src/test/regress/sql/encoding.sql @@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; +SELECT substring(truncated, 1, 4) FROM regress_encoding; SELECT reverse(truncated) FROM regress_encoding; -- invalid short mb character = silently dropped SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding; @@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES SELECT COUNT(test_encoding(encoding, description, input)) > 0 FROM encoding_tests; +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; From: Noah Misch <[email protected]> pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport. The prior order caused spurious Valgrind errors. They're spurious because the ereport(ERROR) stops subsequent code from accessing the memory in question. pg_mblen_cstr() ordered the checks correctly, but these other two did not. Back-patch to v14, like commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728. Reviewed-by: FIXME Discussion: https://postgr.es/m/FIXME Backpatch-through: 14 diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index a5a7348..f3f94d4 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -1086,15 +1086,16 @@ pg_mblen_range(const char *mbstr, const char *end) int length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr); Assert(end > mbstr); + + if (unlikely(mbstr + length > end)) + report_invalid_encoding_db(mbstr, length, end - mbstr); + #ifdef VALGRIND_EXPENSIVE VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, end - mbstr); #else VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length); #endif - if (unlikely(mbstr + length > end)) - report_invalid_encoding_db(mbstr, length, end - mbstr); - return length; } @@ -1109,15 +1110,16 @@ pg_mblen_with_len(const char *mbstr, int limit) int length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr); Assert(limit >= 1); + + if (unlikely(length > limit)) + report_invalid_encoding_db(mbstr, length, limit); + #ifdef VALGRIND_EXPENSIVE VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, limit); #else VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length); #endif - if (unlikely(length > limit)) - report_invalid_encoding_db(mbstr, length, limit); - return length; } Attachments: [text/plain] toast-slice-mblen-v3.patch (8.7K, 2-toast-slice-mblen-v3.patch) download | inline diff: From: Noah Misch <[email protected]> Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: FIXME Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index dbecd71..99a5470 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -133,6 +133,7 @@ static text *text_substring(Datum str, int32 start, int32 length, bool length_not_specified); +static int pg_mbcharcliplen_chars(const char *mbstr, int len, int limit); static text *text_overlay(text *t1, text *t2, int sp, int sl); static int text_position(text *t1, text *t2, Oid collid); static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state); @@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) int32 S = start; /* start position */ int32 S1; /* adjusted start position */ int32 L1; /* adjusted substring length */ - int32 E; /* end position */ + int32 E; /* end position, exclusive */ /* * SQL99 says S can be zero or negative (which we don't document), but we @@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) else { /* - * A zero or negative value for the end position can happen if the - * start was negative or one. SQL99 says to return a zero-length - * string. + * Ending at position 1, exclusive, obviously yields an empty + * string. A zero or negative value can happen if the start was + * negative or one. SQL99 says to return a zero-length string. */ - if (E < 1) + if (E <= 1) return cstring_to_text(""); /* @@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) L1 = E - S1; /* - * Total slice size in bytes can't be any longer than the start - * position plus substring length times the encoding max length. - * If that overflows, we can just use -1. + * Total slice size in bytes can't be any longer than the + * inclusive end position times the encoding max length. If that + * overflows, we can just use -1. */ - if (pg_mul_s32_overflow(E, eml, &slice_size)) + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) slice_size = -1; } @@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* Now we can get the actual length of the slice in MB characters */ - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), - slice_len); + slice_strlen = + (slice_size == -1 ? + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) : + pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1)); /* * Check that the start position wasn't > slice_strlen. If so, SQL99 @@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* + * pg_mbcharcliplen_chars - + * Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes. + * + * This mirrors all the dubious historical behavior, so it's static to + * discourage proliferation. The assertions are specific to the one caller. + */ +static int +pg_mbcharcliplen_chars(const char *mbstr, int len, int limit) +{ + int nch = 0; + int l; + + Assert(len > 0); + Assert(limit > 0); + Assert(pg_database_encoding_max_length() > 1); + + while (len > 0 && *mbstr) + { + l = pg_mblen_with_len(mbstr, len); + nch++; + if (nch == limit) + break; + len -= l; + mbstr += l; + } + return nch; +} + +/* * textoverlay * Replace specified substring of first string with second * diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out index ea1f38c..cac1cb7 100644 --- a/src/test/regress/expected/encoding.out +++ b/src/test/regress/expected/encoding.out @@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; + substring +----------- + caf +(1 row) + +SELECT substring(truncated, 1, 4) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 SELECT reverse(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 @@ -375,7 +381,43 @@ NOTICE: MULE_INTERNAL LC2: \x908283 -> {9470595} -> \x908283 = OK t (1 row) +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; + substring +----------- + +(1 row) + +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + substring +----------- + 🚀 +(1 row) + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql index b9543c0..782f90f 100644 --- a/src/test/regress/sql/encoding.sql +++ b/src/test/regress/sql/encoding.sql @@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; +SELECT substring(truncated, 1, 4) FROM regress_encoding; SELECT reverse(truncated) FROM regress_encoding; -- invalid short mb character = silently dropped SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding; @@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES SELECT COUNT(test_encoding(encoding, description, input)) > 0 FROM encoding_tests; +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; [text/plain] mblen-valgrind-after-report-v1.patch (1.7K, 3-mblen-valgrind-after-report-v1.patch) download | inline diff: From: Noah Misch <[email protected]> pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport. The prior order caused spurious Valgrind errors. They're spurious because the ereport(ERROR) stops subsequent code from accessing the memory in question. pg_mblen_cstr() ordered the checks correctly, but these other two did not. Back-patch to v14, like commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728. Reviewed-by: FIXME Discussion: https://postgr.es/m/FIXME Backpatch-through: 14 diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index a5a7348..f3f94d4 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -1086,15 +1086,16 @@ pg_mblen_range(const char *mbstr, const char *end) int length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr); Assert(end > mbstr); + + if (unlikely(mbstr + length > end)) + report_invalid_encoding_db(mbstr, length, end - mbstr); + #ifdef VALGRIND_EXPENSIVE VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, end - mbstr); #else VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length); #endif - if (unlikely(mbstr + length > end)) - report_invalid_encoding_db(mbstr, length, end - mbstr); - return length; } @@ -1109,15 +1110,16 @@ pg_mblen_with_len(const char *mbstr, int limit) int length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr); Assert(limit >= 1); + + if (unlikely(length > limit)) + report_invalid_encoding_db(mbstr, length, limit); + #ifdef VALGRIND_EXPENSIVE VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, limit); #else VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length); #endif - if (unlikely(length > limit)) - report_invalid_encoding_db(mbstr, length, limit); - return length; } ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> @ 2026-02-14 08:15 ` Thomas Munro <[email protected]> 2026-02-14 19:07 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Thomas Munro @ 2026-02-14 08:15 UTC (permalink / raw) To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected] On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote: > [mblen-valgrind-after-report-v1.patch] LGTM. The new valgrind check should clearly be after the new non-local exit. Studying the other patch... ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 2026-02-14 08:15 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> @ 2026-02-14 19:07 ` Thomas Munro <[email protected]> 2026-02-14 19:33 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Thomas Munro @ 2026-02-14 19:07 UTC (permalink / raw) To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected] On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <[email protected]> wrote: > On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote: > > [mblen-valgrind-after-report-v1.patch] > > LGTM. The new valgrind check should clearly be after the new non-local exit. > > Studying the other patch... /* - * Total slice size in bytes can't be any longer than the start - * position plus substring length times the encoding max length. - * If that overflows, we can just use -1. + * Total slice size in bytes can't be any longer than the + * inclusive end position times the encoding max length. If that + * overflows, we can just use -1. */ - if (pg_mul_s32_overflow(E, eml, &slice_size)) + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) slice_size = -1; Isn't it still conceptually "exclusive", but adjusted to be zero-indexed? /* Now we can get the actual length of the slice in MB characters */ - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), - slice_len); + slice_strlen = + (slice_size == -1 ? + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) : + pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1)); Comment presumably needs adjustment to say that we only count as far as we need to, and why. There is something a bit strange about all this, though. pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters that really exist past 2^29 (~500 million), you must get an empty string, right? That's hard to reach, pre-existing and out of scope for the immediate problem report, except ... now we're contorting the code even further to keep it. The outline I had come up with before seeing your patch was: let's just delete it. The position search can check bounds incrementally, following our general approach. This avoids the reported problem by ditching the pre-flight scan through the slice (up to 4x more pg_mblen_XXX calls and memory access than we strictly need), and also the special cases for empty strings since they already fall out of the general behaviour (am I missing something?), not leaving much code behind. As far as I can see so far, the only user-visible side-effect requires corruption: substring() moves from the internal-NUL-is-terminator category to internal-NUL-is-character category, but that's an implementation detail. When I saw your patch yesterday, I initially abandoned the thought, thinking that your idea looked more conservative, but after sleeping on it and reflecting again on these oddities, I have merged my draft implementation with your tests, ancient detoasting fence post observation and commit message, just to see if you think this approach might be worth considering further. Attachments: [application/x-patch] v2tm-0001-Fix-SUBSTRING-for-toasted-multibyte-characters.patch (13.0K, 2-v2tm-0001-Fix-SUBSTRING-for-toasted-multibyte-characters.patch) download | inline diff: From 9f03e4d23356e8012e8a190bddd81059dc88471b Mon Sep 17 00:00:00 2001 From: Noah Misch <[email protected]> Date: Sat, 14 Feb 2026 21:07:38 +1300 Subject: [PATCH v2tm 1/2] Fix SUBSTRING() for toasted multibyte characters. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() did not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix by removing the pg_mbstrlen_with_len() call. An encoding error is still raised if input runs out in the middle of a multi-byte sequence that is needed for the result, but that can be detected incrementally. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. A long standing overflow bug affecting positions exceeding 2^29 is also addressed, though such large numbers would be unlikely to be used in real applications. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: FIXME Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 --- src/backend/utils/adt/varlena.c | 103 +++++++------------------ src/test/regress/expected/encoding.out | 72 ++++++++++++----- src/test/regress/sql/encoding.sql | 26 +++++-- 3 files changed, 102 insertions(+), 99 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index dbecd7160d6..5e3951ab0e1 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -586,7 +586,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) int32 S = start; /* start position */ int32 S1; /* adjusted start position */ int32 L1; /* adjusted substring length */ - int32 E; /* end position */ + int32 E; /* end position, exclusive */ /* * SQL99 says S can be zero or negative (which we don't document), but we @@ -644,34 +644,27 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) * detoasting, so we'll grab a conservatively large slice now and go * back later to do the right thing */ - int32 slice_start; int32 slice_size; - int32 slice_strlen; - int32 slice_len; + const char *slice_end; text *slice; - int32 E1; int32 i; char *p; char *s; text *ret; - /* - * We need to start at position zero because there is no way to know - * in advance which byte offset corresponds to the supplied start - * position. - */ - slice_start = 0; - - if (length_not_specified) /* special case - get length to end of - * string */ - slice_size = L1 = -1; + if (length_not_specified) /* special case - get whole string */ + { + slice_size = -1; + E = PG_INT32_MAX; + } else if (length < 0) { /* SQL99 says to throw an error for E < S, i.e., negative length */ ereport(ERROR, (errcode(ERRCODE_SUBSTRING_ERROR), errmsg("negative substring length not allowed"))); - slice_size = L1 = -1; /* silence stupider compilers */ + slice_size = -1; /* silence stupider compilers */ + E = PG_INT32_MAX; } else if (pg_add_s32_overflow(S, length, &E)) { @@ -679,30 +672,17 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) * L could be large enough for S + L to overflow, in which case * the substring must run to end of string. */ - slice_size = L1 = -1; + slice_size = -1; + E = PG_INT32_MAX; } else { /* - * A zero or negative value for the end position can happen if the - * start was negative or one. SQL99 says to return a zero-length - * string. - */ - if (E < 1) - return cstring_to_text(""); - - /* - * if E is past the end of the string, the tuple toaster will - * truncate the length for us + * Total slice size in bytes can't be any longer than the end + * position (made zero-based) times the encoding max length. If + * that overflows, we can just use -1. */ - L1 = E - S1; - - /* - * Total slice size in bytes can't be any longer than the start - * position plus substring length times the encoding max length. - * If that overflows, we can just use -1. - */ - if (pg_mul_s32_overflow(E, eml, &slice_size)) + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) slice_size = -1; } @@ -712,63 +692,34 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) */ if (VARATT_IS_COMPRESSED(DatumGetPointer(str)) || VARATT_IS_EXTERNAL(DatumGetPointer(str))) - slice = DatumGetTextPSlice(str, slice_start, slice_size); + slice = DatumGetTextPSlice(str, 0, slice_size); else slice = (text *) DatumGetPointer(str); - /* see if we got back an empty string */ - slice_len = VARSIZE_ANY_EXHDR(slice); - if (slice_len == 0) - { - if (slice != (text *) DatumGetPointer(str)) - pfree(slice); - return cstring_to_text(""); - } - - /* Now we can get the actual length of the slice in MB characters */ - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), - slice_len); - - /* - * Check that the start position wasn't > slice_strlen. If so, SQL99 - * says to return a zero-length string. - */ - if (S1 > slice_strlen) - { - if (slice != (text *) DatumGetPointer(str)) - pfree(slice); - return cstring_to_text(""); - } - - /* - * Adjust L1 and E1 now that we know the slice string length. Again - * remember that S1 is one based, and slice_start is zero based. - */ - if (L1 > -1) - E1 = Min(S1 + L1, slice_start + 1 + slice_strlen); - else - E1 = slice_start + 1 + slice_strlen; + p = VARDATA_ANY(slice); + slice_end = p + VARSIZE_ANY_EXHDR(slice); /* - * Find the start position in the slice; remember S1 is not zero based + * Find the start position in the slice; remember S1 is not zero + * based. If we run out of input first, we'll return an empty string. */ - p = VARDATA_ANY(slice); - for (i = 0; i < S1 - 1; i++) - p += pg_mblen_unbounded(p); + for (i = 1; i < S1 && p < slice_end; i++) + p += pg_mblen_range(p, slice_end); /* hang onto a pointer to our start position */ s = p; /* * Count the actual bytes used by the substring of the requested - * length. + * length, again giving up if we run out of input. */ - for (i = S1; i < E1; i++) - p += pg_mblen_unbounded(p); + for (i = S1; i < E && p < slice_end; i++) + p += pg_mblen_range(p, slice_end); ret = (text *) palloc(VARHDRSZ + (p - s)); SET_VARSIZE(ret, VARHDRSZ + (p - s)); - memcpy(VARDATA(ret), s, (p - s)); + if (s < p) + memcpy(VARDATA(ret), s, (p - s)); if (slice != (text *) DatumGetPointer(str)) pfree(slice); diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out index ea1f38cff41..eec879b05ec 100644 --- a/src/test/regress/expected/encoding.out +++ b/src/test/regress/expected/encoding.out @@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; + substring +----------- + caf +(1 row) + +SELECT substring(truncated, 1, 4) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 SELECT reverse(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 @@ -84,6 +90,13 @@ SELECT length(with_nul) FROM regress_encoding; 4 (1 row) +SELECT regexp_replace(with_nul, '^caf(.)$', '\1') FROM regress_encoding; + regexp_replace +---------------- + é +(1 row) + +-- NUL = character SELECT substring(with_nul, 3, 1) FROM regress_encoding; substring ----------- @@ -96,25 +109,12 @@ SELECT substring(with_nul, 4, 1) FROM regress_encoding; é (1 row) -SELECT substring(with_nul, 5, 1) FROM regress_encoding; - substring ------------ - -(1 row) - -SELECT convert_to(substring(with_nul, 5, 1), 'UTF8') FROM regress_encoding; - convert_to ------------- - \x -(1 row) - -SELECT regexp_replace(with_nul, '^caf(.)$', '\1') FROM regress_encoding; - regexp_replace ----------------- - é +SELECT substring(with_nul, 5, 1) = test_bytea_to_text('\x00') FROM regress_encoding; + ?column? +---------- + t (1 row) --- NUL = character SELECT with_nul, reverse(with_nul), reverse(reverse(with_nul)) FROM regress_encoding; with_nul | reverse | reverse ----------+---------+--------- @@ -375,7 +375,43 @@ NOTICE: MULE_INTERNAL LC2: \x908283 -> {9470595} -> \x908283 = OK t (1 row) +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; + substring +----------- + +(1 row) + +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + substring +----------- + 🚀 +(1 row) + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql index b9543c0cb32..7de95f49947 100644 --- a/src/test/regress/sql/encoding.sql +++ b/src/test/regress/sql/encoding.sql @@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; +SELECT substring(truncated, 1, 4) FROM regress_encoding; SELECT reverse(truncated) FROM regress_encoding; -- invalid short mb character = silently dropped SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding; @@ -51,12 +52,11 @@ SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding; -- NUL = terminator SELECT length(with_nul) FROM regress_encoding; -SELECT substring(with_nul, 3, 1) FROM regress_encoding; -SELECT substring(with_nul, 4, 1) FROM regress_encoding; -SELECT substring(with_nul, 5, 1) FROM regress_encoding; -SELECT convert_to(substring(with_nul, 5, 1), 'UTF8') FROM regress_encoding; SELECT regexp_replace(with_nul, '^caf(.)$', '\1') FROM regress_encoding; -- NUL = character +SELECT substring(with_nul, 3, 1) FROM regress_encoding; +SELECT substring(with_nul, 4, 1) FROM regress_encoding; +SELECT substring(with_nul, 5, 1) = test_bytea_to_text('\x00') FROM regress_encoding; SELECT with_nul, reverse(with_nul), reverse(reverse(with_nul)) FROM regress_encoding; -- If a corrupted string contains NUL in the tail bytes of a multibyte @@ -212,7 +212,23 @@ INSERT INTO encoding_tests VALUES SELECT COUNT(test_encoding(encoding, description, input)) > 0 FROM encoding_tests; +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; -- 2.47.3 ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 2026-02-14 08:15 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 2026-02-14 19:07 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> @ 2026-02-14 19:33 ` Noah Misch <[email protected]> 2026-02-14 20:10 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Noah Misch @ 2026-02-14 19:33 UTC (permalink / raw) To: Thomas Munro <[email protected]>; +Cc: [email protected]; [email protected] On Sun, Feb 15, 2026 at 08:07:22AM +1300, Thomas Munro wrote: > On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <[email protected]> wrote: > > On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote: > > > [mblen-valgrind-after-report-v1.patch] > > > > LGTM. The new valgrind check should clearly be after the new non-local exit. > > > > Studying the other patch... Thank you. > /* > - * Total slice size in bytes can't be any > longer than the start > - * position plus substring length times the > encoding max length. > - * If that overflows, we can just use -1. > + * Total slice size in bytes can't be any > longer than the > + * inclusive end position times the encoding > max length. If that > + * overflows, we can just use -1. > */ > - if (pg_mul_s32_overflow(E, eml, &slice_size)) > + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) > slice_size = -1; > > Isn't it still conceptually "exclusive", but adjusted to be zero-indexed? Since it's discrete (being an integer), "up to E, exclusive" and "up to E - 1, inclusive" are the same thing. My comment may not be the optimal way to express that. > /* Now we can get the actual length of the slice in MB > characters */ > - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), > - > slice_len); > + slice_strlen = > + (slice_size == -1 ? > + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) : > + pg_mbcharcliplen_chars(VARDATA_ANY(slice), > slice_len, E - 1)); > > Comment presumably needs adjustment to say that we only count as far > as we need to, and why. Changed to: /* * Now we can get the actual length of the slice in MB characters, * stopping at the end of the substring. Continuing beyond the * substring end could find an incomplete character attributable * solely to DatumGetTextPSlice() chopping in the middle of a * character, and it would be superfluous work at best. */ > There is something a bit strange about all this, though. > pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters > that really exist past 2^29 (~500 million), you must get an empty > string, right? That's hard to reach, pre-existing and out of scope > for the immediate problem report, except ... now we're contorting the > code even further to keep it. - slice_size is the amount we *requested* from the toaster. It can be -1, which retrieves the max available. - slice_len is the amount *returned* from the toaster. It's nonnegative. Does a behavior specific to strings >2^29 still exist? > The outline I had come up with before seeing your patch was: let's > just delete it. The position search can check bounds incrementally, > following our general approach. This avoids the reported problem by > ditching the pre-flight scan through the slice (up to 4x more > pg_mblen_XXX calls and memory access than we strictly need), and also > the special cases for empty strings since they already fall out of the > general behaviour (am I missing something?), not leaving much code > behind. Like you, I made a note that it's wasteful to make two mblen passes over the string. I'm only seeing a 50% reduction in mblen calls, not an 80% reduction, but I didn't look too closely. I guessed such a change would be less clearly correct, so I figured it would be less suitable for back branches. Hence, I didn't draft it. > As far as I can see so far, the only user-visible side-effect > requires corruption: substring() moves from the > internal-NUL-is-terminator category to internal-NUL-is-character > category, but that's an implementation detail. That does carry some risk, not necessarily too much to accept. > When I saw your patch yesterday, I initially abandoned the thought, > thinking that your idea looked more conservative, but after sleeping > on it and reflecting again on these oddities, I have merged my draft > implementation with your tests, ancient detoasting fence post > observation and commit message, just to see if you think this approach > might be worth considering further. My first impression, hurried due to the commit ETA in 30 minutes, is that this is less conservative and should hold for master-only. ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 2026-02-14 08:15 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 2026-02-14 19:07 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 2026-02-14 19:33 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> @ 2026-02-14 20:10 ` Thomas Munro <[email protected]> 2026-02-14 20:19 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Thomas Munro @ 2026-02-14 20:10 UTC (permalink / raw) To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected] On Sun, Feb 15, 2026 at 8:33 AM Noah Misch <[email protected]> wrote: > - slice_len is the amount *returned* from the toaster. It's nonnegative. Ah, right, that makes more sense. > > The outline I had come up with before seeing your patch was: let's > > just delete it. The position search can check bounds incrementally, > > following our general approach. This avoids the reported problem by > > ditching the pre-flight scan through the slice (up to 4x more > > pg_mblen_XXX calls and memory access than we strictly need), and also > > the special cases for empty strings since they already fall out of the > > general behaviour (am I missing something?), not leaving much code > > behind. > > Like you, I made a note that it's wasteful to make two mblen passes over the > string. I'm only seeing a 50% reduction in mblen calls, not an 80% reduction, > but I didn't look too closely. I guessed such a change would be less clearly > correct, so I figured it would be less suitable for back branches. Hence, I > didn't draft it. I was comparing to unpatched master, but yeah of course your patch already gets part of the way there. > My first impression, hurried due to the commit ETA in 30 minutes, is that this > is less conservative and should hold for master-only. Got it. Will add it to the pile of master-only fallout from this area. ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 2026-02-14 08:15 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 2026-02-14 19:07 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> 2026-02-14 19:33 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 2026-02-14 20:10 ` Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Thomas Munro <[email protected]> @ 2026-02-14 20:19 ` Thomas Munro <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Thomas Munro @ 2026-02-14 20:19 UTC (permalink / raw) To: Noah Misch <[email protected]>; +Cc: [email protected]; [email protected] Also, LGTM. ^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2026-02-14 20:19 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-02-14 05:38 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Noah Misch <[email protected]> 2026-02-14 08:15 ` Thomas Munro <[email protected]> 2026-02-14 19:07 ` Thomas Munro <[email protected]> 2026-02-14 19:33 ` Noah Misch <[email protected]> 2026-02-14 20:10 ` Thomas Munro <[email protected]> 2026-02-14 20:19 ` Thomas Munro <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox