public inbox for [email protected]  
help / color / mirror / Atom feed
dict_synonym.c: fix truncation of multibyte sequence
5+ messages / 2 participants
[nested] [flat]

* dict_synonym.c: fix truncation of multibyte sequence
@ 2026-06-04 22:07 Jeff Davis <[email protected]>
  2026-06-05 15:57 ` Re: dict_synonym.c: fix truncation of multibyte sequence Tristan Partin <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Jeff Davis @ 2026-06-04 22:07 UTC (permalink / raw)
  To: pgsql-hackers


If case_sensitive is false and str_tolower() changes the byte length of
the string, then outlen will be incorrect.

Fortunately, pnstrdup() also stops at a NUL terminator, so it will
never overrun; but if outlen is calculated to be too small, then it
could cause truncation. In any case, the input comes from a trusted
source (dictionary configuration), so it's not very serious.

The correct value of outlen is strlen(d->syn[cur].out). But it's only
ever used in one place, which is a call to pnstrdup(). Given that the
string is NUL-terminated anyway, it's easier to fix it by just changing
that to a pstrdup(). Patch attached, backpatch all the way.

Regards,
	Jeff Davis



Attachments:

  [text/x-patch] v1-0001-dict_synonym.c-remove-incorrect-outlen.patch (1.5K, 2-v1-0001-dict_synonym.c-remove-incorrect-outlen.patch)
  download | inline diff:
From cc395fca68100d8bd354edab2ee93251285c943a Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 15 Apr 2026 12:56:31 -0700
Subject: [PATCH v1] dict_synonym.c: remove incorrect outlen.

Previously, outlen was miscalculated if case_sensitive was false and
str_tolower() changed the byte length of the string. If outlen was too
large, pnstrdup() would stop at the NUL terminator, preventing
overrun. But if outlen was too small, it would cause truncation,
possibly in the middle of a multibyte sequence.

Fix by just removing outlen. Both strings are NUL-terminated anyway.

Backpatch-through: 14
---
 src/backend/tsearch/dict_synonym.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/src/backend/tsearch/dict_synonym.c b/src/backend/tsearch/dict_synonym.c
index 3937f25bcc6..24345767658 100644
--- a/src/backend/tsearch/dict_synonym.c
+++ b/src/backend/tsearch/dict_synonym.c
@@ -24,7 +24,6 @@ typedef struct
 {
 	char	   *in;
 	char	   *out;
-	int			outlen;
 	uint16		flags;
 } Syn;
 
@@ -189,7 +188,6 @@ dsynonym_init(PG_FUNCTION_ARGS)
 			d->syn[cur].out = str_tolower(starto, strlen(starto), DEFAULT_COLLATION_OID);
 		}
 
-		d->syn[cur].outlen = strlen(starto);
 		d->syn[cur].flags = flags;
 
 		cur++;
@@ -237,7 +235,7 @@ dsynonym_lexize(PG_FUNCTION_ARGS)
 		PG_RETURN_POINTER(NULL);
 
 	res = palloc0_array(TSLexeme, 2);
-	res[0].lexeme = pnstrdup(found->out, found->outlen);
+	res[0].lexeme = pstrdup(found->out);
 	res[0].flags = found->flags;
 
 	PG_RETURN_POINTER(res);
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: dict_synonym.c: fix truncation of multibyte sequence
  2026-06-04 22:07 dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
@ 2026-06-05 15:57 ` Tristan Partin <[email protected]>
  2026-06-05 17:37   ` Re: dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Tristan Partin @ 2026-06-05 15:57 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: pgsql-hackers

On Thu Jun 4, 2026 at 10:07 PM UTC, Jeff Davis wrote:
>
> If case_sensitive is false and str_tolower() changes the byte length of
> the string, then outlen will be incorrect.
>
> Fortunately, pnstrdup() also stops at a NUL terminator, so it will
> never overrun; but if outlen is calculated to be too small, then it
> could cause truncation. In any case, the input comes from a trusted
> source (dictionary configuration), so it's not very serious.
>
> The correct value of outlen is strlen(d->syn[cur].out). But it's only
> ever used in one place, which is a call to pnstrdup(). Given that the
> string is NUL-terminated anyway, it's easier to fix it by just changing
> that to a pstrdup(). Patch attached, backpatch all the way.

The fix looks and sounds good. Do we have any way to test this, so it 
doesn't regress in the future? Do we need to export a module to test 
through SQL?

-- 
Tristan Partin
PostgreSQL Contributors Team
AWS (https://aws.amazon.com)






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: dict_synonym.c: fix truncation of multibyte sequence
  2026-06-04 22:07 dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
  2026-06-05 15:57 ` Re: dict_synonym.c: fix truncation of multibyte sequence Tristan Partin <[email protected]>
@ 2026-06-05 17:37   ` Jeff Davis <[email protected]>
  2026-06-05 20:46     ` Re: dict_synonym.c: fix truncation of multibyte sequence Tristan Partin <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Jeff Davis @ 2026-06-05 17:37 UTC (permalink / raw)
  To: Tristan Partin <[email protected]>; +Cc: pgsql-hackers

On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:
> > In any case, the input comes from a trusted
> > source (dictionary configuration), so it's not very serious.
> 
> The fix looks and sounds good. Do we have any way to test this, so it
> doesn't regress in the future?

  -- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
  $ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn

  CREATE TEXT SEARCH DICTIONARY mb_syn (
    TEMPLATE = synonym,
    SYNONYMS = mbtest);

  SELECT ts_lexize('mb_syn', 'foo');

  =# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
   ts_lexize 
  -----------
   {bar}
  (1 row)

  =# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
   ts_lexize 
  -----------
   {barⱥ}
  (1 row)

It requires a specially-crafted synonym file, and I'm not sure it's
worth much effort to add a test for this specific path. If we see
similar bugs, it's more likely to be somewhere else that makes the same
faulty assumption.

If you do think we should add tests, we should probably add a set of
dictionary-related files (.syn, .dict, .ths, etc.) that contain a
variety of adversarial Unicode cases.

I'd be inclined to just commit this fix for now. It needs backpatching,
and I don't think we want to backpatch a large set of tests with it.

Regards,
	Jeff Davis







^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: dict_synonym.c: fix truncation of multibyte sequence
  2026-06-04 22:07 dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
  2026-06-05 15:57 ` Re: dict_synonym.c: fix truncation of multibyte sequence Tristan Partin <[email protected]>
  2026-06-05 17:37   ` Re: dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
@ 2026-06-05 20:46     ` Tristan Partin <[email protected]>
  2026-06-05 21:34       ` Re: dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Tristan Partin @ 2026-06-05 20:46 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: pgsql-hackers

On Fri Jun 5, 2026 at 5:37 PM UTC, Jeff Davis wrote:
> On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:
>> > In any case, the input comes from a trusted
>> > source (dictionary configuration), so it's not very serious.
>> 
>> The fix looks and sounds good. Do we have any way to test this, so it
>> doesn't regress in the future?
>
>   -- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
>   $ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn
>
>   CREATE TEXT SEARCH DICTIONARY mb_syn (
>     TEMPLATE = synonym,
>     SYNONYMS = mbtest);
>
>   SELECT ts_lexize('mb_syn', 'foo');
>
>   =# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
>    ts_lexize 
>   -----------
>    {bar}
>   (1 row)
>
>   =# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
>    ts_lexize 
>   -----------
>    {barⱥ}
>   (1 row)
>
> It requires a specially-crafted synonym file, and I'm not sure it's
> worth much effort to add a test for this specific path. If we see
> similar bugs, it's more likely to be somewhere else that makes the same
> faulty assumption.
>
> If you do think we should add tests, we should probably add a set of
> dictionary-related files (.syn, .dict, .ths, etc.) that contain a
> variety of adversarial Unicode cases.
>
> I'd be inclined to just commit this fix for now. It needs backpatching,
> and I don't think we want to backpatch a large set of tests with it.

I would say proceed as you see fit. I guess I am generally of the 
opinion that additional testing is generally always better, but I don't 
want to push for something if others don't see the same value. I'd be 
happy to provide a patch for the test in a subsequent discussion if that 
is a good middle ground?

-- 
Tristan Partin
PostgreSQL Contributors Team
AWS (https://aws.amazon.com)






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: dict_synonym.c: fix truncation of multibyte sequence
  2026-06-04 22:07 dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
  2026-06-05 15:57 ` Re: dict_synonym.c: fix truncation of multibyte sequence Tristan Partin <[email protected]>
  2026-06-05 17:37   ` Re: dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
  2026-06-05 20:46     ` Re: dict_synonym.c: fix truncation of multibyte sequence Tristan Partin <[email protected]>
@ 2026-06-05 21:34       ` Jeff Davis <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Jeff Davis @ 2026-06-05 21:34 UTC (permalink / raw)
  To: Tristan Partin <[email protected]>; +Cc: pgsql-hackers

On Fri, 2026-06-05 at 20:46 +0000, Tristan Partin wrote:
> I would say proceed as you see fit. I guess I am generally of the 
> opinion that additional testing is generally always better, but I
> don't 
> want to push for something if others don't see the same value. I'd be
> happy to provide a patch for the test in a subsequent discussion if
> that 
> is a good middle ground?

Yes, sounds good.

Regards,
	Jeff Davis







^ permalink  raw  reply  [nested|flat] 5+ messages in thread


end of thread, other threads:[~2026-06-05 21:34 UTC | newest]

Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-06-04 22:07 dict_synonym.c: fix truncation of multibyte sequence Jeff Davis <[email protected]>
2026-06-05 15:57 ` Tristan Partin <[email protected]>
2026-06-05 17:37   ` Jeff Davis <[email protected]>
2026-06-05 20:46     ` Tristan Partin <[email protected]>
2026-06-05 21:34       ` Jeff Davis <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox