public inbox for [email protected]  
help / color / mirror / Atom feed
From: Tristan Partin <[email protected]>
To: Jeff Davis <[email protected]>
Cc: pgsql-hackers <[email protected]>
Subject: Re: dict_synonym.c: fix truncation of multibyte sequence
Date: Fri, 05 Jun 2026 20:46:00 +0000
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>

On Fri Jun 5, 2026 at 5:37 PM UTC, Jeff Davis wrote:
> On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:
>> > In any case, the input comes from a trusted
>> > source (dictionary configuration), so it's not very serious.
>> 
>> The fix looks and sounds good. Do we have any way to test this, so it
>> doesn't regress in the future?
>
>   -- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
>   $ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn
>
>   CREATE TEXT SEARCH DICTIONARY mb_syn (
>     TEMPLATE = synonym,
>     SYNONYMS = mbtest);
>
>   SELECT ts_lexize('mb_syn', 'foo');
>
>   =# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
>    ts_lexize 
>   -----------
>    {bar}
>   (1 row)
>
>   =# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
>    ts_lexize 
>   -----------
>    {barⱥ}
>   (1 row)
>
> It requires a specially-crafted synonym file, and I'm not sure it's
> worth much effort to add a test for this specific path. If we see
> similar bugs, it's more likely to be somewhere else that makes the same
> faulty assumption.
>
> If you do think we should add tests, we should probably add a set of
> dictionary-related files (.syn, .dict, .ths, etc.) that contain a
> variety of adversarial Unicode cases.
>
> I'd be inclined to just commit this fix for now. It needs backpatching,
> and I don't think we want to backpatch a large set of tests with it.

I would say proceed as you see fit. I guess I am generally of the 
opinion that additional testing is generally always better, but I don't 
want to push for something if others don't see the same value. I'd be 
happy to provide a patch for the test in a subsequent discussion if that 
is a good middle ground?

-- 
Tristan Partin
PostgreSQL Contributors Team
AWS (https://aws.amazon.com)






view thread (5+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: dict_synonym.c: fix truncation of multibyte sequence
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox