public inbox for [email protected]  
help / color / mirror / Atom feed
From: Noah Misch <[email protected]>
To: Thomas Munro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
Date: Sat, 14 Feb 2026 11:33:44 -0800
Message-ID: <[email protected]> (raw)
In-Reply-To: <CA+hUKGKy7uqJv9HDMPjDR6O8bv9FSvdg2X5k8z3J1k6QhnJoCg@mail.gmail.com>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CA+hUKGJ2Azp21o+Vm3enjA6WNao_=Xyu_EFjZwoc=YqSg+LjXQ@mail.gmail.com>
	<CA+hUKGKy7uqJv9HDMPjDR6O8bv9FSvdg2X5k8z3J1k6QhnJoCg@mail.gmail.com>

On Sun, Feb 15, 2026 at 08:07:22AM +1300, Thomas Munro wrote:
> On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <[email protected]> wrote:
> > On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <[email protected]> wrote:
> > > [mblen-valgrind-after-report-v1.patch]
> >
> > LGTM.  The new valgrind check should clearly be after the new non-local exit.
> >
> > Studying the other patch...

Thank you.

>                         /*
> -                        * Total slice size in bytes can't be any
> longer than the start
> -                        * position plus substring length times the
> encoding max length.
> -                        * If that overflows, we can just use -1.
> +                        * Total slice size in bytes can't be any
> longer than the
> +                        * inclusive end position times the encoding
> max length.  If that
> +                        * overflows, we can just use -1.
>                          */
> -                       if (pg_mul_s32_overflow(E, eml, &slice_size))
> +                       if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
>                                 slice_size = -1;
> 
> Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?

Since it's discrete (being an integer), "up to E, exclusive" and "up to E - 1,
inclusive" are the same thing.  My comment may not be the optimal way to
express that.

>                 /* Now we can get the actual length of the slice in MB
> characters */
> -               slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
> -
>                  slice_len);
> +               slice_strlen =
> +                       (slice_size == -1 ?
> +                        pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
> +                        pg_mbcharcliplen_chars(VARDATA_ANY(slice),
> slice_len, E - 1));
> 
> Comment presumably needs adjustment to say that we only count as far
> as we need to, and why.

Changed to:

		/*
		 * Now we can get the actual length of the slice in MB characters,
		 * stopping at the end of the substring.  Continuing beyond the
		 * substring end could find an incomplete character attributable
		 * solely to DatumGetTextPSlice() chopping in the middle of a
		 * character, and it would be superfluous work at best.
		 */

> There is something a bit strange about all this, though.
> pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
> that really exist past 2^29 (~500 million), you must get an empty
> string, right?   That's hard to reach, pre-existing and out of scope
> for the immediate problem report, except ... now we're contorting the
> code even further to keep it.

- slice_size is the amount we *requested* from the toaster.  It can be -1,
  which retrieves the max available.
- slice_len is the amount *returned* from the toaster.  It's nonnegative.

Does a behavior specific to strings >2^29 still exist?

> The outline I had come up with before seeing your patch was: let's
> just delete it.  The position search can check bounds incrementally,
> following our general approach.  This avoids the reported problem by
> ditching the pre-flight scan through the slice (up to 4x more
> pg_mblen_XXX calls and memory access than we strictly need), and also
> the special cases for empty strings since they already fall out of the
> general behaviour (am I missing something?), not leaving much code
> behind.

Like you, I made a note that it's wasteful to make two mblen passes over the
string.  I'm only seeing a 50% reduction in mblen calls, not an 80% reduction,
but I didn't look too closely.  I guessed such a change would be less clearly
correct, so I figured it would be less suitable for back branches.  Hence, I
didn't draft it.

> As far as I can see so far, the only user-visible side-effect
> requires corruption: substring() moves from the
> internal-NUL-is-terminator category to internal-NUL-is-character
> category, but that's an implementation detail.

That does carry some risk, not necessarily too much to accept.

> When I saw your patch yesterday, I initially abandoned the thought,
> thinking that your idea looked more conservative, but after sleeping
> on it and reflecting again on these oddities, I have merged my draft
> implementation with your tests, ancient detoasting fence post
> observation and commit message, just to see if you think this approach
> might be worth considering further.

My first impression, hurried due to the commit ETA in 30 minutes, is that this
is less conservative and should hold for master-only.






reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox