Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vr1xO-005FIl-2M for pgsql-bugs@arkaria.postgresql.org; Fri, 13 Feb 2026 22:48:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vr1xL-00GZN2-1V for pgsql-bugs@arkaria.postgresql.org; Fri, 13 Feb 2026 22:48:11 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vr1xK-00GZMs-2W for pgsql-bugs@lists.postgresql.org; Fri, 13 Feb 2026 22:48:11 +0000 Received: from mail-dl1-x122a.google.com ([2607:f8b0:4864:20::122a]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vr1xH-00000000W8i-2jTt for pgsql-bugs@lists.postgresql.org; Fri, 13 Feb 2026 22:48:09 +0000 Received: by mail-dl1-x122a.google.com with SMTP id a92af1059eb24-1270adc5121so2138214c88.0 for ; Fri, 13 Feb 2026 14:48:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=leadboat.com; s=google; t=1771022887; x=1771627687; darn=lists.postgresql.org; h=user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=7HM3PrVDL2KFUmFSDJPKb4HAIS2KP0KnrCTQrTbj6/w=; b=G4hHKguK7ZBVPXvfJi/Uft3IiwD18I9qCVj1P89OTqKgBty8S14TxFpMbA4Vcg2vk2 h48tcezlA/T5AYlELglmShnMe9g+/CodnPBl0/HAoqQR89davg8NkxkrtrWotURWtZ0U NYzBaVnWuX5GDKpzb1Mgyn7CzflK3EJdzA36k= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771022887; x=1771627687; h=user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=7HM3PrVDL2KFUmFSDJPKb4HAIS2KP0KnrCTQrTbj6/w=; b=qIQZdZvBWdsnGekHF7AqVv5kEqJYNFM3FZZNT02kx5wEQy62E5/BjzDNAjx4j6cRYW bOdR/b6+41uWkSvTME7vvIS9qSK+HYNhQ0SfEnnQo+g96r9FjRLs9NM0AwvzjXrsXj6O ydAWjnTV+TMV8F5GMOzJXy/ysfJD9CRlVRIKr8tioNHQPsfhtT+pBspwgDq9hxNwTBxb y5nV33UxHBiFV6w8Jb4ZOlUPkJXwQNuz+sVvxSiqoTPfPK1M8r9adDKjDfPlyCND8pks 25ZxlRwAL/8zkhkNtnxzLlAC9a8ByXUG2U1o5dF4n4u1XOUgBOpQfAo4W4DBAl22Yw/C T3+w== X-Forwarded-Encrypted: i=1; AJvYcCUbNzMw59LGNbm9aSeyG/8uy5/Jv8oTLoTx81HE29A8EHwNN5JQNzi0B0HwzKqRmh8i4znOcxXjdMnR@lists.postgresql.org X-Gm-Message-State: AOJu0YxP7sSYCmFwebEgwTxQL8knu4FdXhp3d4turrOPdVGAEoUqmd48 doRHIHjMET087p/bTETm6lSKboeLZkT1HQF6CTKSFYcEBcG4Q32TZXGoY7KvU34Axw== X-Gm-Gg: AZuq6aKCE/vT5+IL5k6UENgXEkzu1GT/RJaTB9zjmXEdq1+xPUTxUyJ4WrMDZBxx6X9 o3CkdMj0+zjwIRLjHlFe/PYmIJIYPdKPT37zZAmPWXlypAyZKKKNP0mbdKKGhJTH0ZABi/yYUAl NfcI7MIxX7w56y0mv8n9RCitsVgk8JLXoV0AUwRqH+HOBVr+isGT9L1UojEWCF9uj6Wdn4DNgi7 +/GAwCuNKHoM+e7s/iuIYrQ/yzLNy0HwPssETYnmLMuF+m60wt87LiHZgTg9cVLLpS+YzuGGERQ r7Jb6F/dpKGUoecUq004ghMvigkieurVxIKO0Wg1/EBEuj51qlmwjQUZsNdIOuXsJElHU+Jhjtn md5U5CFfYJdTRPHbQN0+kLB+5C35AgJ+8v2vD7qWgg2HWSD/04pza9BvnfFiQwEsk7V52QpNZat LicowTRkqEuXjKJxrIaxGef3c0hORde/Y1b3n2O6cUOaGZpbONdXGQHhIcn3I= X-Received: by 2002:a05:7301:4888:b0:2ae:5d5e:9b1c with SMTP id 5a478bee46e88-2babc3b7810mr1603721eec.2.1771022887446; Fri, 13 Feb 2026 14:48:07 -0800 (PST) Received: from rfd.leadboat.com (c-73-15-160-255.hsd1.ca.comcast.net. [73.15.160.255]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-2bacb66addcsm179384eec.24.2026.02.13.14.48.06 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 13 Feb 2026 14:48:06 -0800 (PST) Date: Fri, 13 Feb 2026 14:48:04 -0800 From: Noah Misch To: ranvis@gmail.com, pgsql-bugs@lists.postgresql.org Cc: thomas.munro@gmail.com Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 Message-ID: <20260213224804.2c@rfd.leadboat.com> References: <19406-9867fddddd724fca@postgresql.org> <20260213172702.71@rfd.leadboat.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="t86TgJqs7ZWodUQg" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260213172702.71@rfd.leadboat.com> User-Agent: Mutt/2.2.12 (2023-09-09) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --t86TgJqs7ZWodUQg Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Feb 13, 2026 at 09:27:02AM -0800, Noah Misch wrote: > On Fri, Feb 13, 2026 at 07:46:22AM +0000, PG Bug reporting form wrote: > > After upgrading from PostgreSQL 15.15 to 15.16, substring(text) raises: > > >ERROR: invalid byte sequence for encoding "UTF8": 0xe6 0x97 > > on valid UTF-8 text stored in a TOAST-compressed column. > > > user=> select substring(data from 1 for 1) from toast_repro; > > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xe6 0x97 > > Thanks for the report. That is a bug and a regression; I regret missing it > during review. The substring operation works by taking a 4-byte slice from > the toasted value (4 bytes being the max length of a UTF8 char in PostgreSQL), > the finding the actual first character within those bytes. However, it > incorrectly requires those four bytes to be a valid UTF8 string. I'll start > on a fix. Attached. I may add some more tests, e.g. a toasted invalid string where the detoasted length is less than the slice we request. This version is viable, however. I audited the other pg_mbstrlen_with_len(), and I think they're all okay with an error if the input has an incomplete char. Hence, those don't need changes beyond what we're already released. Most pass either parser input or an existing datum with its len. text_position_get_match_pos() is the most subtle caller, and I think it's fine. I audited other uses of slice detoast. The only other one is bytea substring, which is obviously indifferent to character encoding. --t86TgJqs7ZWodUQg Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="toast-slice-mblen-v1.patch" Content-Transfer-Encoding: 8bit From: Noah Misch Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, that yields 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. Fix that incidentally. That and stopping the char count early might improve performance. Back-patch to v14 (all supported versions). Reported-by: SATŌ Kentarō Reviewed-by: FIXME Bug: #19406 Discussion: https://postgr.es/m/19406-9867fddddd724fca@postgresql.org Backpatch-through: 14 diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index dbecd71..d0bc9e5 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -133,6 +133,7 @@ static text *text_substring(Datum str, int32 start, int32 length, bool length_not_specified); +static int pg_mbcharcliplen_chars(const char *mbstr, int len, int limit); static text *text_overlay(text *t1, text *t2, int sp, int sl); static int text_position(text *t1, text *t2, Oid collid); static void text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state); @@ -586,7 +587,7 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) int32 S = start; /* start position */ int32 S1; /* adjusted start position */ int32 L1; /* adjusted substring length */ - int32 E; /* end position */ + int32 E; /* end position, exclusive */ /* * SQL99 says S can be zero or negative (which we don't document), but we @@ -684,11 +685,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) else { /* - * A zero or negative value for the end position can happen if the - * start was negative or one. SQL99 says to return a zero-length - * string. + * Ending at position 1, exclusive, obviously yields an empty + * string. A zero or negative value can happen if the start was + * negative or one. SQL99 says to return a zero-length string. */ - if (E < 1) + if (E <= 1) return cstring_to_text(""); /* @@ -698,11 +699,11 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) L1 = E - S1; /* - * Total slice size in bytes can't be any longer than the start - * position plus substring length times the encoding max length. - * If that overflows, we can just use -1. + * Total slice size in bytes can't be any longer than the + * inclusive end position times the encoding max length. If that + * overflows, we can just use -1. */ - if (pg_mul_s32_overflow(E, eml, &slice_size)) + if (pg_mul_s32_overflow(E - 1, eml, &slice_size)) slice_size = -1; } @@ -726,8 +727,10 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* Now we can get the actual length of the slice in MB characters */ - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice), - slice_len); + slice_strlen = + (slice_size != -1 ? + pg_mbcharcliplen_chars(VARDATA_ANY(slice), slice_len, E - 1) : + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len)); /* * Check that the start position wasn't > slice_strlen. If so, SQL99 @@ -783,6 +786,35 @@ text_substring(Datum str, int32 start, int32 length, bool length_not_specified) } /* + * pg_mbcharcliplen_chars - + * Mirror pg_mbcharcliplen(), except return value unit is chars, not bytes. + * + * This mirrors all the dubious historical behavior, so it's static to + * discourage proliferation. The assertions are specific to the one caller. + */ +static int +pg_mbcharcliplen_chars(const char *mbstr, int len, int limit) +{ + int nch = 0; + int l; + + Assert(len > 0); + Assert(limit > 0); + Assert(pg_database_encoding_max_length() > 1); + + while (len > 0 && *mbstr) + { + l = pg_mblen_with_len(mbstr, len); + nch++; + if (nch == limit) + break; + len -= l; + mbstr += l; + } + return nch; +} + +/* * textoverlay * Replace specified substring of first string with second * diff --git a/src/test/regress/expected/encoding.out b/src/test/regress/expected/encoding.out index ea1f38c..d850664 100644 --- a/src/test/regress/expected/encoding.out +++ b/src/test/regress/expected/encoding.out @@ -63,7 +63,13 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; + substring +----------- + caf +(1 row) + +SELECT substring(truncated, 1, 4) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 SELECT reverse(truncated) FROM regress_encoding; ERROR: invalid byte sequence for encoding "UTF8": 0xc3 @@ -388,6 +394,16 @@ SELECT SUBSTRING('a' SIMILAR U&'\00AC' ESCAPE U&'\00A7'); (1 row) +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; + substring +----------- + … +(1 row) + -- Levenshtein distance metric: exercise character length cache. SELECT U&"real\00A7_name" FROM (select 1) AS x(real_name); ERROR: column "real§_name" does not exist diff --git a/src/test/regress/sql/encoding.sql b/src/test/regress/sql/encoding.sql index b9543c0..1b2178b 100644 --- a/src/test/regress/sql/encoding.sql +++ b/src/test/regress/sql/encoding.sql @@ -40,7 +40,8 @@ SELECT reverse(good) FROM regress_encoding; -- invalid short mb character = error SELECT length(truncated) FROM regress_encoding; -SELECT substring(truncated, 1, 1) FROM regress_encoding; +SELECT substring(truncated, 1, 3) FROM regress_encoding; +SELECT substring(truncated, 1, 4) FROM regress_encoding; SELECT reverse(truncated) FROM regress_encoding; -- invalid short mb character = silently dropped SELECT regexp_replace(truncated, '^caf(.)$', '\1') FROM regress_encoding; @@ -222,6 +223,11 @@ DROP FUNCTION test_text_to_bytea; -- substring slow path: multi-byte escape char vs. multi-byte pattern char. SELECT SUBSTRING('a' SIMILAR U&'\00AC' ESCAPE U&'\00A7'); +-- substring fetches a slice of a toasted value; unused tail of that slice is +-- an incomplete char (bug #19406) +CREATE TABLE toast_3b_utf8 (c text); +INSERT INTO toast_3b_utf8 VALUES (repeat(U&'\2026', 4000)); +SELECT SUBSTRING(c FROM 1 FOR 1) FROM toast_3b_utf8; -- Levenshtein distance metric: exercise character length cache. SELECT U&"real\00A7_name" FROM (select 1) AS x(real_name); -- JSON errcontext: truncate long data. --t86TgJqs7ZWodUQg--