public inbox for [email protected]help / color / mirror / Atom feed
pgsql: Fix SUBSTRING() for toasted multibyte characters. 6+ messages / 1 participants [nested] [flat]
* pgsql: Fix SUBSTRING() for toasted multibyte characters. @ 2026-02-14 20:17 Noah Misch <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Noah Misch @ 2026-02-14 20:17 UTC (permalink / raw) To: [email protected] Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: Thomas Munro <[email protected]> Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_17_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/5d5232bc38d36a30b569396f15e5a22cd6ee529b Modified Files -------------- src/backend/utils/adt/varlena.c | 62 +++++++++++++++++++++++++++------- src/test/regress/expected/encoding.out | 44 +++++++++++++++++++++++- src/test/regress/sql/encoding.sql | 19 ++++++++++- 3 files changed, 111 insertions(+), 14 deletions(-) ^ permalink raw reply [nested|flat] 6+ messages in thread
* pgsql: Fix SUBSTRING() for toasted multibyte characters. @ 2026-02-14 20:17 Noah Misch <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Noah Misch @ 2026-02-14 20:17 UTC (permalink / raw) To: [email protected] Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: Thomas Munro <[email protected]> Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_16_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/bdfb372280bc507239a02c0a060f0f51ebd41fa1 Modified Files -------------- src/backend/utils/adt/varlena.c | 62 +++++++++++++++++++++++++++------- src/test/regress/expected/encoding.out | 44 +++++++++++++++++++++++- src/test/regress/sql/encoding.sql | 19 ++++++++++- 3 files changed, 111 insertions(+), 14 deletions(-) ^ permalink raw reply [nested|flat] 6+ messages in thread
* pgsql: Fix SUBSTRING() for toasted multibyte characters. @ 2026-02-14 20:17 Noah Misch <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Noah Misch @ 2026-02-14 20:17 UTC (permalink / raw) To: [email protected] Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: Thomas Munro <[email protected]> Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_15_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/a20eb248c51ad6741bbbb3b3bac103d5788bd4f5 Modified Files -------------- src/backend/utils/adt/varlena.c | 62 +++++++++++++++++++++++++++------- src/test/regress/expected/encoding.out | 44 +++++++++++++++++++++++- src/test/regress/sql/encoding.sql | 19 ++++++++++- 3 files changed, 111 insertions(+), 14 deletions(-) ^ permalink raw reply [nested|flat] 6+ messages in thread
* pgsql: Fix SUBSTRING() for toasted multibyte characters. @ 2026-02-14 20:17 Noah Misch <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Noah Misch @ 2026-02-14 20:17 UTC (permalink / raw) To: [email protected] Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: Thomas Munro <[email protected]> Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_14_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/14b1fd6176cb9353846deff607e5dfad09eb5b23 Modified Files -------------- src/backend/utils/adt/varlena.c | 62 ++++++++++++++++++++++++++------- src/test/regress/input/encoding.source | 19 +++++++++- src/test/regress/output/encoding.source | 44 ++++++++++++++++++++++- 3 files changed, 111 insertions(+), 14 deletions(-) ^ permalink raw reply [nested|flat] 6+ messages in thread
* pgsql: Fix SUBSTRING() for toasted multibyte characters. @ 2026-02-14 20:17 Noah Misch <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Noah Misch @ 2026-02-14 20:17 UTC (permalink / raw) To: [email protected] Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: Thomas Munro <[email protected]> Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/9f4fd119b2cbb9a41ec0c19a8d6ec9b59b92c125 Modified Files -------------- src/backend/utils/adt/varlena.c | 62 +++++++++++++++++++++++++++------- src/test/regress/expected/encoding.out | 44 +++++++++++++++++++++++- src/test/regress/sql/encoding.sql | 19 ++++++++++- 3 files changed, 111 insertions(+), 14 deletions(-) ^ permalink raw reply [nested|flat] 6+ messages in thread
* pgsql: Fix SUBSTRING() for toasted multibyte characters. @ 2026-02-14 20:17 Noah Misch <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Noah Misch @ 2026-02-14 20:17 UTC (permalink / raw) To: [email protected] Fix SUBSTRING() for toasted multibyte characters. Commit 1e7fe06c10c0a8da9dd6261a6be8d405dc17c728 changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <[email protected]> Reviewed-by: Thomas Munro <[email protected]> Bug: #19406 Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_18_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/6e045e1a6e3f1a55d3d246f8258f3316410b26f6 Modified Files -------------- src/backend/utils/adt/varlena.c | 62 +++++++++++++++++++++++++++------- src/test/regress/expected/encoding.out | 44 +++++++++++++++++++++++- src/test/regress/sql/encoding.sql | 19 ++++++++++- 3 files changed, 111 insertions(+), 14 deletions(-) ^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2026-02-14 20:17 UTC | newest] Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-02-14 20:17 pgsql: Fix SUBSTRING() for toasted multibyte characters. Noah Misch <[email protected]> 2026-02-14 20:17 pgsql: Fix SUBSTRING() for toasted multibyte characters. Noah Misch <[email protected]> 2026-02-14 20:17 pgsql: Fix SUBSTRING() for toasted multibyte characters. Noah Misch <[email protected]> 2026-02-14 20:17 pgsql: Fix SUBSTRING() for toasted multibyte characters. Noah Misch <[email protected]> 2026-02-14 20:17 pgsql: Fix SUBSTRING() for toasted multibyte characters. Noah Misch <[email protected]> 2026-02-14 20:17 pgsql: Fix SUBSTRING() for toasted multibyte characters. Noah Misch <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox