public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Latest patches break one of our unit-test, related to RLS
9+ messages / 5 participants
[nested] [flat]

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-12 13:34  Dominique Devienne <[email protected]>
  0 siblings, 2 replies; 9+ messages in thread

From: Dominique Devienne @ 2025-09-12 13:34 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: pgsql-general

On Fri, Sep 12, 2025 at 3:29 PM Dominique Devienne <[email protected]> wrote:
> On Fri, Sep 12, 2025 at 3:24 PM Dominique Devienne <[email protected]> wrote:
> > On Fri, Sep 12, 2025 at 3:11 PM Dominique Devienne <[email protected]> wrote:
>
> > So I don't see how my `... where v similar to 'foo[\d\w]_%'` is incorrect.
> > So again, is this a bug / regression or not? Thanks, --DD
>
> If I use (x|y) instead of [xy] it seems to behave correctly.
> Whether x is the full-length POSIX class, or the shorthand notation.
> This DOES look like a bug, no? I've done regexes for a long time,
> and these two forms should be equivalent IMHO. --DD
>
> postgres=# show server_version;
>  server_version
> ----------------
>  18rc1
> (1 row)
>
>
> postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
> select v from t where v similar to 'foo[\d\w]_%';
>  v
> ---
> (0 rows)
>
>
> postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
> select v from t where v similar to 'foo[[[:digit:]][[:word:]]]_%';
>  v
> ---
> (0 rows)
>
>
> postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
> select v from t where v similar to 'foo([[:digit:]]|[[:word:]])_%';
>     v
> ---------
>  foo0bar
> (1 row)
>
>
> postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
> select v from t where v similar to 'foo(\d|\w)_%';
>     v
> ---------
>  foo0bar
> (1 row)

For completeness:

postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
select v from t where v similar to 'foo[0-9a-zA-Z]_%';
    v
---------
 foo0bar
(1 row)






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-12 13:53  jian he <[email protected]>
  parent: Dominique Devienne <[email protected]>
  1 sibling, 1 reply; 9+ messages in thread

From: jian he @ 2025-09-12 13:53 UTC (permalink / raw)
  To: Dominique Devienne <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general

On Fri, Sep 12, 2025 at 9:35 PM Dominique Devienne <[email protected]> wrote:
>
> On Fri, Sep 12, 2025 at 3:29 PM Dominique Devienne <[email protected]> wrote:
> > On Fri, Sep 12, 2025 at 3:24 PM Dominique Devienne <[email protected]> wrote:
> > > On Fri, Sep 12, 2025 at 3:11 PM Dominique Devienne <[email protected]> wrote:
> >
> > > So I don't see how my `... where v similar to 'foo[\d\w]_%'` is incorrect.
> > > So again, is this a bug / regression or not? Thanks, --DD
> >
> > If I use (x|y) instead of [xy] it seems to behave correctly.
> > Whether x is the full-length POSIX class, or the shorthand notation.
> > This DOES look like a bug, no? I've done regexes for a long time,
> > and these two forms should be equivalent IMHO. --DD
> >
> > postgres=# show server_version;
> >  server_version
> > ----------------
> >  18rc1
> > (1 row)
> >
> >
> > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
> > select v from t where v similar to 'foo[\d\w]_%';
> >  v
> > ---
> > (0 rows)
> >
> >
> > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
> > select v from t where v similar to 'foo[[[:digit:]][[:word:]]]_%';
> >  v
> > ---
> > (0 rows)
> >

The above two examples are the same, per
(Table 9.21. Regular Expression Class-Shorthand Escapes)
https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP

my guess why 'foo0bar' will fail for 'foo[[[:digit:]][[:word:]]]_%';
1. process character 0, it does meet [[:digits]] character class.
2. process character b, it does not meet [[:digits]], then fails,
it won't check again whether character b is satisfied with [[:word:]] or not.






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-12 13:57  Dominique Devienne <[email protected]>
  parent: jian he <[email protected]>
  0 siblings, 0 replies; 9+ messages in thread

From: Dominique Devienne @ 2025-09-12 13:57 UTC (permalink / raw)
  To: jian he <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general

On Fri, Sep 12, 2025 at 3:54 PM jian he <[email protected]> wrote:
> > > select v from t where v similar to 'foo[\d\w]_%';
> > > select v from t where v similar to 'foo[[[:digit:]][[:word:]]]_%';

> The above two examples are the same, per
> (Table 9.21. Regular Expression Class-Shorthand Escapes)

Of course.

> my guess why 'foo0bar' will fail for 'foo[[[:digit:]][[:word:]]]_%';
> 1. process character 0, it does meet [[:digits]] character class.
> 2. process character b, it does not meet [[:digits]], then fails,
> it won't check again whether character b is satisfied with [[:word:]] or not.

Then you don't know what [...] means in regexes I'm afraid. --DD






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-12 14:07  Tom Lane <[email protected]>
  parent: Dominique Devienne <[email protected]>
  1 sibling, 2 replies; 9+ messages in thread

From: Tom Lane @ 2025-09-12 14:07 UTC (permalink / raw)
  To: Dominique Devienne <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general

Dominique Devienne <[email protected]> writes:
>> This DOES look like a bug, no? I've done regexes for a long time,
>> and these two forms should be equivalent IMHO. --DD

Yeah, I agree it's busted.  You can use EXPLAIN VERBOSE to see the
translated-to-POSIX pattern, and it's wrong:

regression=# explain verbose with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))                                         
select v from t where v similar to 'foo[\d\w]_%';
                          QUERY PLAN                          
--------------------------------------------------------------
 Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=1 width=32)
   Output: "*VALUES*".column1
   Filter: ("*VALUES*".column1 ~ '^(?:foo[\d\w]_%)$'::text)
(3 rows)

The _ and % are not getting converted to their POSIX equivalents
("." and ".*").  Your other example still does that correctly:

regression=# explain verbose with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))
select v from t where v similar to 'foo[0-9a-zA-Z]_%';
                            QUERY PLAN                            
------------------------------------------------------------------
 Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=1 width=32)
   Output: "*VALUES*".column1
   Filter: ("*VALUES*".column1 ~ '^(?:foo[0-9a-zA-Z]..*)$'::text)
(3 rows)

So e3ffc3e91 was at least one brick shy of a load.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-12 16:32  Álvaro Herrera <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 0 replies; 9+ messages in thread

From: Álvaro Herrera @ 2025-09-12 16:32 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Dominique Devienne <[email protected]>; Laurenz Albe <[email protected]>; pgsql-general; Michael Paquier <[email protected]>

On 2025-Sep-12, Tom Lane wrote:

> Dominique Devienne <[email protected]> writes:
> >> This DOES look like a bug, no? I've done regexes for a long time,
> >> and these two forms should be equivalent IMHO. --DD
> 
> Yeah, I agree it's busted.  You can use EXPLAIN VERBOSE to see the
> translated-to-POSIX pattern, and it's wrong:

And now we also have
https://postgr.es/m/[email protected]

CC'ed the committer here.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-12 23:34  Laurenz Albe <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 1 reply; 9+ messages in thread

From: Laurenz Albe @ 2025-09-12 23:34 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; Dominique Devienne <[email protected]>; +Cc: pgsql-general

On Fri, 2025-09-12 at 10:07 -0400, Tom Lane wrote:
> Dominique Devienne <[email protected]> writes:
> > > This DOES look like a bug, no? I've done regexes for a long time,
> > > and these two forms should be equivalent IMHO. --DD
> 
> Yeah, I agree it's busted.  You can use EXPLAIN VERBOSE to see the
> translated-to-POSIX pattern, and it's wrong:
> 
> regression=# explain verbose with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar'))                                         
> select v from t where v similar to 'foo[\d\w]_%';
>                           QUERY PLAN                          
> --------------------------------------------------------------
>  Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=1 width=32)
>    Output: "*VALUES*".column1
>    Filter: ("*VALUES*".column1 ~ '^(?:foo[\d\w]_%)$'::text)
> (3 rows)
> 
> The _ and % are not getting converted to their POSIX equivalents
> ("." and ".*").

Indeed, and I have to take the blame for introducing a bug in a minor
release :^(

The attached patch should fix the problem.

Yours,
Laurenz Albe


Attachments:

  [text/x-patch] v1-0001-Amend-recent-fix-for-SIMILAR-TO-regex-conversion.patch (3.1K, 2-v1-0001-Amend-recent-fix-for-SIMILAR-TO-regex-conversion.patch)
  download | inline diff:
From 8d0f8ebac6c42fe7da36ec8c30ee091d20270068 Mon Sep 17 00:00:00 2001
From: Laurenz Albe <[email protected]>
Date: Sat, 13 Sep 2025 01:32:55 +0200
Subject: [PATCH v1] Amend recent fix for SIMILAR TO regex conversion

Commit e3ffc3e91d fixed the translation of character classes in
SIMILAR TO regular expressions.  Unfortunately the fix broke a corner
case: if there is an escape character right after the opening bracket,
(for example in "[\q]") a closing bracket right after the escape
sequence would not be seen as closing the character class.

Author: Laurenz Albe <[email protected]>
Reported-By: Dominique Devienne <[email protected]>
Reported-By: Stephan Springl <[email protected]>
Backpatch-through: 13
---
 src/backend/utils/adt/regexp.c        | 6 ++++++
 src/test/regress/expected/strings.out | 9 +++++++++
 src/test/regress/sql/strings.sql      | 3 +++
 3 files changed, 18 insertions(+)

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 6e2864cbbda..29692fb1a9d 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -948,6 +948,12 @@ similar_escape_internal(text *pat_text, text *esc_text)
 				 */
 				*r++ = '\\';
 				*r++ = pchar;
+				/*
+				 * If we encounter an escaped character in a character class,
+				 * we must be past an initial ^ or ].
+				 */
+				if (charclass_depth > 0)
+					charclass_start = 3;
 			}
 			afterescape = false;
 		}
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index ba302da51e7..2d6cb02ad60 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -693,6 +693,15 @@ EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[^^]^';
    Filter: (f1 ~ '^(?:[^^]\^)$'::text)
 (2 rows)
 
+-- Closing square bracket after an escape sequence at the beginning of
+-- a character closes the character class
+EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[|a]%' ESCAPE '|';
+              QUERY PLAN               
+---------------------------------------
+ Seq Scan on text_tbl
+   Filter: (f1 ~ '^(?:[\a].*)$'::text)
+(2 rows)
+
 -- Test backslash escapes in regexp_replace's replacement string
 SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
  regexp_replace 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index b94004cc08c..5ed421d6205 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -218,6 +218,9 @@ EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[]%][^]%][^%]%';
 -- Closing square bracket effective after two carets at the beginning
 -- of character class.
 EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[^^]^';
+-- Closing square bracket after an escape sequence at the beginning of
+-- a character closes the character class
+EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[|a]%' ESCAPE '|';
 
 -- Test backslash escapes in regexp_replace's replacement string
 SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-13 00:12  Tom Lane <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 9+ messages in thread

From: Tom Lane @ 2025-09-13 00:12 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: Dominique Devienne <[email protected]>; pgsql-general

Laurenz Albe <[email protected]> writes:
> On Fri, 2025-09-12 at 10:07 -0400, Tom Lane wrote:
>> The _ and % are not getting converted to their POSIX equivalents
>> ("." and ".*").

> Indeed, and I have to take the blame for introducing a bug in a minor
> release :^(
> The attached patch should fix the problem.

I had not particularly studied the new charclass-parsing logic.
Looking at it now, this bit further down (lines 871ff) looks
fishy:

            if (pchar == ']' && charclass_start > 2)
                charclass_depth--;
            else if (pchar == '[')
                charclass_depth++;

            /*
             * If there is a caret right after the opening bracket, it negates
             * the character class, but a following closing bracket should
             * still be treated as a normal character.  That holds only for
             * the first caret, so only the values 1 and 2 mean that closing
             * brackets should be taken literally.
             */
            if (pchar == '^')
                charclass_start++;
            else
                charclass_start = 3;    /* definitely past the start */

Should not we be setting charclass_start to 1 after incrementing
charclass_depth?  That is, I'd be more comfortable if this logic
looked like

            if (pchar == ']' && charclass_start > 2)
                charclass_depth--;
            else if (pchar == '[')
            {
                /* start of a nested character class */
                charclass_depth++;
                charclass_start = 1;
            }
            else if (pchar == '^')
                charclass_start++;
            else
                charclass_start = 3;    /* definitely past the start */

I haven't experimented, but it looks like this might misprocess
^ or ] at the start of a nested character class.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-13 01:17  Laurenz Albe <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 9+ messages in thread

From: Laurenz Albe @ 2025-09-13 01:17 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Dominique Devienne <[email protected]>; pgsql-general

On Fri, 2025-09-12 at 20:12 -0400, Tom Lane wrote:
> I had not particularly studied the new charclass-parsing logic.
> Looking at it now, this bit further down (lines 871ff) looks
> fishy:
> 
>             if (pchar == ']' && charclass_start > 2)
>                 charclass_depth--;
>             else if (pchar == '[')
>                 charclass_depth++;
> 
>             /*
>              * If there is a caret right after the opening bracket, it negates
>              * the character class, but a following closing bracket should
>              * still be treated as a normal character.  That holds only for
>              * the first caret, so only the values 1 and 2 mean that closing
>              * brackets should be taken literally.
>              */
>             if (pchar == '^')
>                 charclass_start++;
>             else
>                 charclass_start = 3;    /* definitely past the start */
> 
> Should not we be setting charclass_start to 1 after incrementing
> charclass_depth?

What I call "charclass depth" is misleading, I am afraid.
Really, it should be "bracket depth".  Only the outermost pair of brackets
starts an actual character class.  Example:

  []abc[:digit:]]

A caret or closing bracket right after the inner opening bracket wouldn't
be a special character, and I think it would never be legal.

Unfortunately, this is all pretty complicated.

Perhaps s/charclass_depth/bracket_depth/ would be a good idea.

Yours,
Laurenz Albe


^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Latest patches break one of our unit-test, related to RLS
@ 2025-09-13 01:53  Tom Lane <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 0 replies; 9+ messages in thread

From: Tom Lane @ 2025-09-13 01:53 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: Dominique Devienne <[email protected]>; pgsql-general

Laurenz Albe <[email protected]> writes:
> On Fri, 2025-09-12 at 20:12 -0400, Tom Lane wrote:
>> Should not we be setting charclass_start to 1 after incrementing
>> charclass_depth?

> What I call "charclass depth" is misleading, I am afraid.
> Really, it should be "bracket depth".  Only the outermost pair of brackets
> starts an actual character class.  Example:
>   []abc[:digit:]]
> A caret or closing bracket right after the inner opening bracket wouldn't
> be a special character, and I think it would never be legal.

Ah, got it.  But this logic definitely deserves more comments.
What do you think of something like

            if (pchar == ']' && charclass_start > 2)
            {
                /* found the real end of a bracket pair */
                charclass_depth--;
                /* past start of outer brackets, so keep charclass_start > 2 */
            }
            else if (pchar == '[')
            {
                /* start of a nested bracket pair */
                charclass_depth++;
                /* leading ^ or ] in this context is not special */
                charclass_start = 3;
            }
            else if (pchar == '^')
            {
                /* okay to increment charclass_start even if already > 1 */
                charclass_start++;
            }
            else
            {
                /* otherwise (including case of leading ']') */
                charclass_start = 3;    /* definitely past the start */
            }

> Perhaps s/charclass_depth/bracket_depth/ would be a good idea.

Wouldn't object to that.  Maybe we can think of a better name for
"charclass_start" too --- that sounds like a boolean condition.
The best I'm coming up with right now is "charclass_count", but
that's not that helpful.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 9+ messages in thread


end of thread, other threads:[~2025-09-13 01:53 UTC | newest]

Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-09-12 13:34 Re: Latest patches break one of our unit-test, related to RLS Dominique Devienne <[email protected]>
2025-09-12 13:53 ` jian he <[email protected]>
2025-09-12 13:57   ` Dominique Devienne <[email protected]>
2025-09-12 14:07 ` Tom Lane <[email protected]>
2025-09-12 16:32   ` Álvaro Herrera <[email protected]>
2025-09-12 23:34   ` Laurenz Albe <[email protected]>
2025-09-13 00:12     ` Tom Lane <[email protected]>
2025-09-13 01:17       ` Laurenz Albe <[email protected]>
2025-09-13 01:53         ` Tom Lane <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox