Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vr8MU-0075On-0y for pgsql-bugs@arkaria.postgresql.org; Sat, 14 Feb 2026 05:38:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vr8MR-00H4rg-05 for pgsql-bugs@arkaria.postgresql.org; Sat, 14 Feb 2026 05:38:31 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vr8MQ-00H4rY-1N for pgsql-bugs@lists.postgresql.org; Sat, 14 Feb 2026 05:38:30 +0000 Received: from mail-dl1-x122c.google.com ([2607:f8b0:4864:20::122c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vr8MN-00000000bMY-2ENl for pgsql-bugs@lists.postgresql.org; Sat, 14 Feb 2026 05:38:29 +0000 Received: by mail-dl1-x122c.google.com with SMTP id a92af1059eb24-12732165d1eso1978802c88.1 for ; Fri, 13 Feb 2026 21:38:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=leadboat.com; s=google; t=1771047504; x=1771652304; darn=lists.postgresql.org; h=user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=k+Ag97JTpZOSUxbKbK5lheo7wikVPXwz/QmsLwW7Qkc=; b=Bh9aGjbHPyZ//b+JfVRe2lflCUNKB7Y15zxW66BSVKNZk42E2vQIPYP6vxKus6oCqe 1wqmDQXWS4wUdmXD/bHUJHSaEPrNNEV56Mg4Ud5PCTn7lXsK8jzgJf9ByberrEcm+uWl taiLe/zJFn6vX1Cy12iuvap8IuWUm2PWJz/EI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771047504; x=1771652304; h=user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=k+Ag97JTpZOSUxbKbK5lheo7wikVPXwz/QmsLwW7Qkc=; b=bM8XxaiAeai7aKNhcjjOx78mjDVWeZkDrWIe/22NFJMNEwfUVAYklHM0/UNX8pBCbm 3Vm42v9U2AZxUDCbDKeEUVHE2hzzWDbOPtEl1QRpQ4L6zCmdOzZmF6/jlEvAz+ZRMGPR vIwKJk+ZRVxrLg0Q64hgxLA3QFS6z3lZblhzdUpoopYu2QPzxwDEKhpII3YWpr8wvCEO TIKakMAEs74sbmubXedLdvcNp1LbfDwLB8ZP4Dov7TTBRRx2zlEKNIbbnRQ7isjlx4OD NkZdsFPMyJYm10RKxgm1L/aGXtprd5hFkIpoeThtRVVnYk8RnrRiwvgaibLFvOfya/2L RMog== X-Forwarded-Encrypted: i=1; AJvYcCW5lwszmLfI0Mvr9jlfUxgN1ekKw3zI91VT8oMEDnsoRV4Kr2lpFc41/CTzbmcZn9fSZ7m8qhAlzWev@lists.postgresql.org X-Gm-Message-State: AOJu0Yz6Kbknq7QA8seen1WAmZ61/fGjqEswfavlmN3UGhMSl1hwDp1k AXeNAYwUW8z3qv10Qj4EhIQF5lJYHgiYiOD+tnsOA77chp0nVBfUCuLBrPgQ3iKMFQ== X-Gm-Gg: AZuq6aJ2KRR+QulvpzaXTohb2KmCa62s6eeZxrqvJV5wFVWElK8a/dAG99fI6jFtIeS g8S+6JQCMQudiPhst6CPIb2c+YETB+ius++TISbBZxUgtYB8I2USpkjUXnxo9gOiAD+1xLcCaXH w2BNMyrg3qJGX9aHo95Dou9fRxQVzfRRZAN+iS36QC7BgLjeJ6EJKegF6A/tb3LMMxed0A2eEfs 5mJqt70AXQ3lKQdVMo40A5W9bawKjVmJiurjTv2OtvX9u64k9Fw1BuCnRdqXsuGZYe4aW5WtFs9 WE9iOSmiR6V9hZITsvplXNsbDMRrXxfvQxlsvzB21Yu6McimQ+Pc2udsr9Xg4HaEN/SQ94R9fsF mr+04RVGtOmnL/+m4HlcrVhr4Kaga33H68JQsva+x9s240kzTTw1vONq6xNvJUfE4mU2630TXR7 H9joUYcjhB+YBR/uCiTuFw8Y1ULOSBcB19ULhjvB7Zq7tz2Bc6szB2O8RPynHARFu9FQ== X-Received: by 2002:a05:7300:80d6:b0:2ba:990a:4829 with SMTP id 5a478bee46e88-2babc556720mr1492030eec.30.1771047503915; Fri, 13 Feb 2026 21:38:23 -0800 (PST) Received: from microsoft.com (c-73-15-160-255.hsd1.ca.comcast.net. [73.15.160.255]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-12742cadb25sm1260709c88.10.2026.02.13.21.38.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 13 Feb 2026 21:38:23 -0800 (PST) Date: Fri, 13 Feb 2026 21:38:21 -0800 From: Noah Misch To: ranvis@gmail.com, pgsql-bugs@lists.postgresql.org Cc: thomas.munro@gmail.com Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Message-ID: <20260214053821.fa.noahmisch@microsoft.com> References: <19406-9867fddddd724fca@postgresql.org> <20260213172702.71@rfd.leadboat.com> <20260213224804.2c@rfd.leadboat.com> <20260214002113.1f.noahmisch@microsoft.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="opQlKuIvXApRYIw/" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260214002113.1f.noahmisch@microsoft.com> User-Agent: Mutt/2.2.12 (2023-09-09) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --opQlKuIvXApRYIw/ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Feb 13, 2026 at 04:21:13PM -0800, Noah Misch wrote: > Review welcome. I have a Valgrind test run ongoing. Valgrind found the complaint below, but I think this an instrumentation problem. I've added a fix for that instrumentation. I also made minor edits to the log message of the main patch, hence v3. The release team is preparing to announce a 2026-02-26 out-of-cycle release in light of this regression. I plan to push these fixes at 2026-02-14T20:00+0000 to unblock that formal announcement. ==00:00:01:11.756 3464664== VALGRINDERROR-BEGIN ==00:00:01:11.756 3464664== Unaddressable byte(s) found during client check request ==00:00:01:11.769 3464664== at 0xC6076A: pg_mblen_with_len (mbutils.c:1115) ==00:00:01:11.769 3464664== by 0xC07CD3: pg_mbcharcliplen_chars (varlena.c:807) ==00:00:01:11.770 3464664== by 0xC07AAD: text_substring (varlena.c:732) ==00:00:01:11.770 3464664== by 0xC07797: text_substr (varlena.c:553) ==00:00:01:11.770 3464664== by 0x779688: ExecInterpExpr (execExprInterp.c:953) ==00:00:01:11.770 3464664== by 0x77BBD6: ExecInterpExprStillValid (execExprInterp.c:2299) ==00:00:01:11.770 3464664== by 0x7DBBC4: ExecEvalExprNoReturn (executor.h:423) ==00:00:01:11.770 3464664== by 0x7DBC73: ExecEvalExprNoReturnSwitchContext (executor.h:464) ==00:00:01:11.770 3464664== by 0x7DBCD3: ExecProject (executor.h:496) ==00:00:01:11.770 3464664== by 0x7DC134: ExecScanExtended (execScan.h:234) ==00:00:01:11.770 3464664== by 0x7DC474: ExecSeqScanWithProject (nodeSeqscan.c:162) ==00:00:01:11.770 3464664== by 0x794561: ExecProcNodeFirst (execProcnode.c:469) ==00:00:01:11.770 3464664== Address 0x19340a5e is 12,062 bytes inside a block of size 12,064 alloc'd ==00:00:01:11.770 3464664== at 0x4C29F73: malloc (vg_replace_malloc.c:309) ==00:00:01:11.770 3464664== by 0xC79F03: AllocSetAllocLarge (aset.c:756) ==00:00:01:11.770 3464664== by 0xC7AA4C: AllocSetAlloc (aset.c:1033) ==00:00:01:11.770 3464664== by 0xC8B37E: palloc (mcxt.c:1408) ==00:00:01:11.770 3464664== by 0x4B0182: detoast_attr_slice (detoast.c:324) ==00:00:01:11.770 3464664== by 0xC5177F: pg_detoast_datum_slice (fmgr.c:1825) ==00:00:01:11.770 3464664== by 0xC07A11: text_substring (varlena.c:716) ==00:00:01:11.770 3464664== by 0xC07797: text_substr (varlena.c:553) ==00:00:01:11.770 3464664== by 0x779688: ExecInterpExpr (execExprInterp.c:953) ==00:00:01:11.770 3464664== by 0x77BBD6: ExecInterpExprStillValid (execExprInterp.c:2299) ==00:00:01:11.770 3464664== by 0x7DBBC4: ExecEvalExprNoReturn (executor.h:423) ==00:00:01:11.770 3464664== by 0x7DBC73: ExecEvalExprNoReturnSwitchContext (executor.h:464) ==00:00:01:11.770 3464664== ==00:00:01:11.770 3464664== VALGRINDERROR-END { Memcheck:User fun:pg_mblen_with_len fun:pg_mbcharcliplen_chars fun:text_substring fun:text_substr fun:ExecInterpExpr fun:ExecInterpExprStillValid fun:ExecEvalExprNoReturn fun:ExecEvalExprNoReturnSwitchContext fun:ExecProject fun:ExecScanExtended fun:ExecSeqScanWithProject fun:ExecProcNodeFirst } 2026-02-13 21:03:38.905 PST client backend[3464664] pg_regress/encoding ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 2026-02-13 21:03:38.905 PST client backend[3464664] pg_regress/encoding STATEMENT: SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; **00:00:01:11.771 3464664** Valgrind detected 1 error(s) during execution of "SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8;" --opQlKuIvXApRYIw/ Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="toast-slice-mblen-v3.patch" Content-Transfer-Encoding: 8bit From: Noah Misch Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō Reviewed-by: FIXME Bug: #19406 Discussion: https://postgr.es/m/19406-9867fddddd724fca@postgresql.org Backpatch-through: 14 diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index dbecd71..99a5470 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -133,6 +133,7 @@ static text *text_substring(Datum str, int32 start, int32 length, bool length_not_specified); +static int pg_mbcharcliplen_chars(const char *mbstr, int len, int limit); static text *text_overlay(text *t1, text *t2, int sp, int sl); static int text_position(text *t1, text *t2, Oid collid); static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state); @@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) int32 S = start; /* start position */ int32 S1; /* adjusted start position */ int32 L1; /* adjusted substring length */ - int32 E; /* end position */ + int32 E; /* end position, exclusive */ /* * SQL99 says S can be zero or negative (which we don't document), but we @@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) else { /* - * A zero or negative value for the end position can happen if the - * start was negative or one. SQL99 says to return a zero-length - * string. + * Ending at position 1, exclusive, obviously yields an empty + * string. A zero or negative value can happen if the start was + * negative or one. SQL99 says to return a zero-length string. */ - if (E < 1) + if (E <= 1) return cstring_to_text(""); /* @@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) L1 = E - S1; /* - * Total slice size in bytes can't be any longer than the start - * position plus substring length times the encoding max length. - * If that overflows, we can just use -1. + * Total slice size in bytes can't be any longer than the + * inclusive end position times the encoding max length. If that + * overflows, we can just use -1. */ - if (pg_mul_s32_overflow(E, eml, &slice_size)) + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) slice_size = -1; } @@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* Now we can get the actual length of the slice in MB characters */ - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), - slice_len); + slice_strlen = + (slice_size == -1 ? + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) : + pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1)); /* * Check that the start position wasn't > slice_strlen. If so, SQL99 @@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* + * pg_mbcharcliplen_chars - + * Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes. + * + * This mirrors all the dubious historical behavior, so it's static to + * discourage proliferation. The assertions are specific to the one caller. + */ +static int +pg_mbcharcliplen_chars(const char *mbstr, int len, int limit) +{ + int nch = 0; + int l; + + Assert(len > 0); + Assert(limit > 0); + Assert(pg_database_encoding_max_length() > 1); + + while (len > 0 && *mbstr) + { + l = pg_mblen_with_len(mbstr, len); + nch++; + if (nch == limit) + break; + len -= l; + mbstr += l; + } + return nch; +} + +/* * textoverlay * Replace specified substring of first string with second * diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out index ea1f38c..cac1cb7 100644 --- a/src/test/regress/expected/encoding.out +++ b/src/test/regress/expected/encoding.out @@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; + substring +----------- + caf +(1 row) + +SELECT substring(truncated, 1, 4) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 SELECT reverse(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 @@ -375,7 +381,43 @@ NOTICE: MULE_INTERNAL LC2: \x908283 -> {9470595} -> \x908283 = OK t (1 row) +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; + substring +----------- + +(1 row) + +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + substring +----------- + 🚀 +(1 row) + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql index b9543c0..782f90f 100644 --- a/src/test/regress/sql/encoding.sql +++ b/src/test/regress/sql/encoding.sql @@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; +SELECT substring(truncated, 1, 4) FROM regress_encoding; SELECT reverse(truncated) FROM regress_encoding; -- invalid short mb character = silently dropped SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding; @@ -212,7 +213,23 @@ INSERT INTO encoding_tests VALUES SELECT COUNT(test_encoding(encoding, description, input)) > 0 FROM encoding_tests; +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- diagnose incomplete char iff within the substring +UPDATE toast_3b_utf8 SET c = c || test_bytea_to_text('\xe280'); +SELECT SUBSTRING(c FROM 4000 FOR 1) FROM toast_3b_utf8; +SELECT SUBSTRING(c FROM 4001 FOR 1) FROM toast_3b_utf8; +-- substring needing last byte of its slice_size +ALTER TABLE toast_3b_utf8 RENAME TO toast_4b_utf8; +UPDATE toast_4b_utf8 SET c = repeat(U&'\+01F680', 3000); +SELECT SUBSTRING(c FROM 3000 FOR 1) FROM toast_4b_utf8; + DROP TABLE encoding_tests; +DROP TABLE toast_4b_utf8; DROP FUNCTION test_encoding; DROP FUNCTION test_text_to_wchars; DROP FUNCTION test_mblen_func; --opQlKuIvXApRYIw/ Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="mblen-valgrind-after-report-v1.patch" From: Noah Misch pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport. The prior order caused spurious Valgrind errors. They're spurious because the ereport(ERROR) stops subsequent code from accessing the memory in question. pg_mblen_cstr() ordered the checks correctly, but these other two did not. Back-patch to v14, like commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728. Reviewed-by: FIXME Discussion: https://postgr.es/m/FIXME Backpatch-through: 14 diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index a5a7348..f3f94d4 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -1086,15 +1086,16 @@ pg_mblen_range(const char *mbstr, const char *end) int length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr); Assert(end > mbstr); + + if (unlikely(mbstr + length > end)) + report_invalid_encoding_db(mbstr, length, end - mbstr); + #ifdef VALGRIND_EXPENSIVE VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, end - mbstr); #else VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length); #endif - if (unlikely(mbstr + length > end)) - report_invalid_encoding_db(mbstr, length, end - mbstr); - return length; } @@ -1109,15 +1110,16 @@ pg_mblen_with_len(const char *mbstr, int limit) int length = pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr); Assert(limit >= 1); + + if (unlikely(length > limit)) + report_invalid_encoding_db(mbstr, length, limit); + #ifdef VALGRIND_EXPENSIVE VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, limit); #else VALGRIND_CHECK_MEM_IS_DEFINED(mbstr, length); #endif - if (unlikely(length > limit)) - report_invalid_encoding_db(mbstr, length, limit); - return length; } --opQlKuIvXApRYIw/--