public inbox for [email protected]help / color / mirror / Atom feed
Re: Latest patches break one of our unit-test, related to RLS 9+ messages / 5 participants [nested] [flat]
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-12 13:34 Dominique Devienne <[email protected]> 0 siblings, 2 replies; 9+ messages in thread From: Dominique Devienne @ 2025-09-12 13:34 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: pgsql-general On Fri, Sep 12, 2025 at 3:29 PM Dominique Devienne <[email protected]> wrote: > On Fri, Sep 12, 2025 at 3:24 PM Dominique Devienne <[email protected]> wrote: > > On Fri, Sep 12, 2025 at 3:11 PM Dominique Devienne <[email protected]> wrote: > > > So I don't see how my `... where v similar to 'foo[\d\w]_%'` is incorrect. > > So again, is this a bug / regression or not? Thanks, --DD > > If I use (x|y) instead of [xy] it seems to behave correctly. > Whether x is the full-length POSIX class, or the shorthand notation. > This DOES look like a bug, no? I've done regexes for a long time, > and these two forms should be equivalent IMHO. --DD > > postgres=# show server_version; > server_version > ---------------- > 18rc1 > (1 row) > > > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > select v from t where v similar to 'foo[\d\w]_%'; > v > --- > (0 rows) > > > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > select v from t where v similar to 'foo[[[:digit:]][[:word:]]]_%'; > v > --- > (0 rows) > > > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > select v from t where v similar to 'foo([[:digit:]]|[[:word:]])_%'; > v > --------- > foo0bar > (1 row) > > > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > select v from t where v similar to 'foo(\d|\w)_%'; > v > --------- > foo0bar > (1 row) For completeness: postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) select v from t where v similar to 'foo[0-9a-zA-Z]_%'; v --------- foo0bar (1 row) ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-12 13:53 jian he <[email protected]> parent: Dominique Devienne <[email protected]> 1 sibling, 1 reply; 9+ messages in thread From: jian he @ 2025-09-12 13:53 UTC (permalink / raw) To: Dominique Devienne <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general On Fri, Sep 12, 2025 at 9:35 PM Dominique Devienne <[email protected]> wrote: > > On Fri, Sep 12, 2025 at 3:29 PM Dominique Devienne <[email protected]> wrote: > > On Fri, Sep 12, 2025 at 3:24 PM Dominique Devienne <[email protected]> wrote: > > > On Fri, Sep 12, 2025 at 3:11 PM Dominique Devienne <[email protected]> wrote: > > > > > So I don't see how my `... where v similar to 'foo[\d\w]_%'` is incorrect. > > > So again, is this a bug / regression or not? Thanks, --DD > > > > If I use (x|y) instead of [xy] it seems to behave correctly. > > Whether x is the full-length POSIX class, or the shorthand notation. > > This DOES look like a bug, no? I've done regexes for a long time, > > and these two forms should be equivalent IMHO. --DD > > > > postgres=# show server_version; > > server_version > > ---------------- > > 18rc1 > > (1 row) > > > > > > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > > select v from t where v similar to 'foo[\d\w]_%'; > > v > > --- > > (0 rows) > > > > > > postgres=# with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > > select v from t where v similar to 'foo[[[:digit:]][[:word:]]]_%'; > > v > > --- > > (0 rows) > > The above two examples are the same, per (Table 9.21. Regular Expression Class-Shorthand Escapes) https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP my guess why 'foo0bar' will fail for 'foo[[[:digit:]][[:word:]]]_%'; 1. process character 0, it does meet [[:digits]] character class. 2. process character b, it does not meet [[:digits]], then fails, it won't check again whether character b is satisfied with [[:word:]] or not. ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-12 13:57 Dominique Devienne <[email protected]> parent: jian he <[email protected]> 0 siblings, 0 replies; 9+ messages in thread From: Dominique Devienne @ 2025-09-12 13:57 UTC (permalink / raw) To: jian he <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general On Fri, Sep 12, 2025 at 3:54 PM jian he <[email protected]> wrote: > > > select v from t where v similar to 'foo[\d\w]_%'; > > > select v from t where v similar to 'foo[[[:digit:]][[:word:]]]_%'; > The above two examples are the same, per > (Table 9.21. Regular Expression Class-Shorthand Escapes) Of course. > my guess why 'foo0bar' will fail for 'foo[[[:digit:]][[:word:]]]_%'; > 1. process character 0, it does meet [[:digits]] character class. > 2. process character b, it does not meet [[:digits]], then fails, > it won't check again whether character b is satisfied with [[:word:]] or not. Then you don't know what [...] means in regexes I'm afraid. --DD ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-12 14:07 Tom Lane <[email protected]> parent: Dominique Devienne <[email protected]> 1 sibling, 2 replies; 9+ messages in thread From: Tom Lane @ 2025-09-12 14:07 UTC (permalink / raw) To: Dominique Devienne <[email protected]>; +Cc: Laurenz Albe <[email protected]>; pgsql-general Dominique Devienne <[email protected]> writes: >> This DOES look like a bug, no? I've done regexes for a long time, >> and these two forms should be equivalent IMHO. --DD Yeah, I agree it's busted. You can use EXPLAIN VERBOSE to see the translated-to-POSIX pattern, and it's wrong: regression=# explain verbose with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) select v from t where v similar to 'foo[\d\w]_%'; QUERY PLAN -------------------------------------------------------------- Values Scan on "*VALUES*" (cost=0.00..0.05 rows=1 width=32) Output: "*VALUES*".column1 Filter: ("*VALUES*".column1 ~ '^(?:foo[\d\w]_%)$'::text) (3 rows) The _ and % are not getting converted to their POSIX equivalents ("." and ".*"). Your other example still does that correctly: regression=# explain verbose with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) select v from t where v similar to 'foo[0-9a-zA-Z]_%'; QUERY PLAN ------------------------------------------------------------------ Values Scan on "*VALUES*" (cost=0.00..0.05 rows=1 width=32) Output: "*VALUES*".column1 Filter: ("*VALUES*".column1 ~ '^(?:foo[0-9a-zA-Z]..*)$'::text) (3 rows) So e3ffc3e91 was at least one brick shy of a load. regards, tom lane ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-12 16:32 Álvaro Herrera <[email protected]> parent: Tom Lane <[email protected]> 1 sibling, 0 replies; 9+ messages in thread From: Álvaro Herrera @ 2025-09-12 16:32 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: Dominique Devienne <[email protected]>; Laurenz Albe <[email protected]>; pgsql-general; Michael Paquier <[email protected]> On 2025-Sep-12, Tom Lane wrote: > Dominique Devienne <[email protected]> writes: > >> This DOES look like a bug, no? I've done regexes for a long time, > >> and these two forms should be equivalent IMHO. --DD > > Yeah, I agree it's busted. You can use EXPLAIN VERBOSE to see the > translated-to-POSIX pattern, and it's wrong: And now we also have https://postgr.es/m/[email protected] CC'ed the committer here. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo" ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-12 23:34 Laurenz Albe <[email protected]> parent: Tom Lane <[email protected]> 1 sibling, 1 reply; 9+ messages in thread From: Laurenz Albe @ 2025-09-12 23:34 UTC (permalink / raw) To: Tom Lane <[email protected]>; Dominique Devienne <[email protected]>; +Cc: pgsql-general On Fri, 2025-09-12 at 10:07 -0400, Tom Lane wrote: > Dominique Devienne <[email protected]> writes: > > > This DOES look like a bug, no? I've done regexes for a long time, > > > and these two forms should be equivalent IMHO. --DD > > Yeah, I agree it's busted. You can use EXPLAIN VERBOSE to see the > translated-to-POSIX pattern, and it's wrong: > > regression=# explain verbose with t(v) as (values ('foo:bar'), ('foo/bar'), ('foo0bar')) > select v from t where v similar to 'foo[\d\w]_%'; > QUERY PLAN > -------------------------------------------------------------- > Values Scan on "*VALUES*" (cost=0.00..0.05 rows=1 width=32) > Output: "*VALUES*".column1 > Filter: ("*VALUES*".column1 ~ '^(?:foo[\d\w]_%)$'::text) > (3 rows) > > The _ and % are not getting converted to their POSIX equivalents > ("." and ".*"). Indeed, and I have to take the blame for introducing a bug in a minor release :^( The attached patch should fix the problem. Yours, Laurenz Albe Attachments: [text/x-patch] v1-0001-Amend-recent-fix-for-SIMILAR-TO-regex-conversion.patch (3.1K, 2-v1-0001-Amend-recent-fix-for-SIMILAR-TO-regex-conversion.patch) download | inline diff: From 8d0f8ebac6c42fe7da36ec8c30ee091d20270068 Mon Sep 17 00:00:00 2001 From: Laurenz Albe <[email protected]> Date: Sat, 13 Sep 2025 01:32:55 +0200 Subject: [PATCH v1] Amend recent fix for SIMILAR TO regex conversion Commit e3ffc3e91d fixed the translation of character classes in SIMILAR TO regular expressions. Unfortunately the fix broke a corner case: if there is an escape character right after the opening bracket, (for example in "[\q]") a closing bracket right after the escape sequence would not be seen as closing the character class. Author: Laurenz Albe <[email protected]> Reported-By: Dominique Devienne <[email protected]> Reported-By: Stephan Springl <[email protected]> Backpatch-through: 13 --- src/backend/utils/adt/regexp.c | 6 ++++++ src/test/regress/expected/strings.out | 9 +++++++++ src/test/regress/sql/strings.sql | 3 +++ 3 files changed, 18 insertions(+) diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 6e2864cbbda..29692fb1a9d 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -948,6 +948,12 @@ similar_escape_internal(text *pat_text, text *esc_text) */ *r++ = '\\'; *r++ = pchar; + /* + * If we encounter an escaped character in a character class, + * we must be past an initial ^ or ]. + */ + if (charclass_depth > 0) + charclass_start = 3; } afterescape = false; } diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out index ba302da51e7..2d6cb02ad60 100644 --- a/src/test/regress/expected/strings.out +++ b/src/test/regress/expected/strings.out @@ -693,6 +693,15 @@ EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[^^]^'; Filter: (f1 ~ '^(?:[^^]\^)$'::text) (2 rows) +-- Closing square bracket after an escape sequence at the beginning of +-- a character closes the character class +EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[|a]%' ESCAPE '|'; + QUERY PLAN +--------------------------------------- + Seq Scan on text_tbl + Filter: (f1 ~ '^(?:[\a].*)$'::text) +(2 rows) + -- Test backslash escapes in regexp_replace's replacement string SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3'); regexp_replace diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql index b94004cc08c..5ed421d6205 100644 --- a/src/test/regress/sql/strings.sql +++ b/src/test/regress/sql/strings.sql @@ -218,6 +218,9 @@ EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[]%][^]%][^%]%'; -- Closing square bracket effective after two carets at the beginning -- of character class. EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[^^]^'; +-- Closing square bracket after an escape sequence at the beginning of +-- a character closes the character class +EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[|a]%' ESCAPE '|'; -- Test backslash escapes in regexp_replace's replacement string SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3'); -- 2.51.0 ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-13 00:12 Tom Lane <[email protected]> parent: Laurenz Albe <[email protected]> 0 siblings, 1 reply; 9+ messages in thread From: Tom Lane @ 2025-09-13 00:12 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: Dominique Devienne <[email protected]>; pgsql-general Laurenz Albe <[email protected]> writes: > On Fri, 2025-09-12 at 10:07 -0400, Tom Lane wrote: >> The _ and % are not getting converted to their POSIX equivalents >> ("." and ".*"). > Indeed, and I have to take the blame for introducing a bug in a minor > release :^( > The attached patch should fix the problem. I had not particularly studied the new charclass-parsing logic. Looking at it now, this bit further down (lines 871ff) looks fishy: if (pchar == ']' && charclass_start > 2) charclass_depth--; else if (pchar == '[') charclass_depth++; /* * If there is a caret right after the opening bracket, it negates * the character class, but a following closing bracket should * still be treated as a normal character. That holds only for * the first caret, so only the values 1 and 2 mean that closing * brackets should be taken literally. */ if (pchar == '^') charclass_start++; else charclass_start = 3; /* definitely past the start */ Should not we be setting charclass_start to 1 after incrementing charclass_depth? That is, I'd be more comfortable if this logic looked like if (pchar == ']' && charclass_start > 2) charclass_depth--; else if (pchar == '[') { /* start of a nested character class */ charclass_depth++; charclass_start = 1; } else if (pchar == '^') charclass_start++; else charclass_start = 3; /* definitely past the start */ I haven't experimented, but it looks like this might misprocess ^ or ] at the start of a nested character class. regards, tom lane ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-13 01:17 Laurenz Albe <[email protected]> parent: Tom Lane <[email protected]> 0 siblings, 1 reply; 9+ messages in thread From: Laurenz Albe @ 2025-09-13 01:17 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: Dominique Devienne <[email protected]>; pgsql-general On Fri, 2025-09-12 at 20:12 -0400, Tom Lane wrote: > I had not particularly studied the new charclass-parsing logic. > Looking at it now, this bit further down (lines 871ff) looks > fishy: > > if (pchar == ']' && charclass_start > 2) > charclass_depth--; > else if (pchar == '[') > charclass_depth++; > > /* > * If there is a caret right after the opening bracket, it negates > * the character class, but a following closing bracket should > * still be treated as a normal character. That holds only for > * the first caret, so only the values 1 and 2 mean that closing > * brackets should be taken literally. > */ > if (pchar == '^') > charclass_start++; > else > charclass_start = 3; /* definitely past the start */ > > Should not we be setting charclass_start to 1 after incrementing > charclass_depth? What I call "charclass depth" is misleading, I am afraid. Really, it should be "bracket depth". Only the outermost pair of brackets starts an actual character class. Example: []abc[:digit:]] A caret or closing bracket right after the inner opening bracket wouldn't be a special character, and I think it would never be legal. Unfortunately, this is all pretty complicated. Perhaps s/charclass_depth/bracket_depth/ would be a good idea. Yours, Laurenz Albe ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Latest patches break one of our unit-test, related to RLS @ 2025-09-13 01:53 Tom Lane <[email protected]> parent: Laurenz Albe <[email protected]> 0 siblings, 0 replies; 9+ messages in thread From: Tom Lane @ 2025-09-13 01:53 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: Dominique Devienne <[email protected]>; pgsql-general Laurenz Albe <[email protected]> writes: > On Fri, 2025-09-12 at 20:12 -0400, Tom Lane wrote: >> Should not we be setting charclass_start to 1 after incrementing >> charclass_depth? > What I call "charclass depth" is misleading, I am afraid. > Really, it should be "bracket depth". Only the outermost pair of brackets > starts an actual character class. Example: > []abc[:digit:]] > A caret or closing bracket right after the inner opening bracket wouldn't > be a special character, and I think it would never be legal. Ah, got it. But this logic definitely deserves more comments. What do you think of something like if (pchar == ']' && charclass_start > 2) { /* found the real end of a bracket pair */ charclass_depth--; /* past start of outer brackets, so keep charclass_start > 2 */ } else if (pchar == '[') { /* start of a nested bracket pair */ charclass_depth++; /* leading ^ or ] in this context is not special */ charclass_start = 3; } else if (pchar == '^') { /* okay to increment charclass_start even if already > 1 */ charclass_start++; } else { /* otherwise (including case of leading ']') */ charclass_start = 3; /* definitely past the start */ } > Perhaps s/charclass_depth/bracket_depth/ would be a good idea. Wouldn't object to that. Maybe we can think of a better name for "charclass_start" too --- that sounds like a boolean condition. The best I'm coming up with right now is "charclass_count", but that's not that helpful. regards, tom lane ^ permalink raw reply [nested|flat] 9+ messages in thread
end of thread, other threads:[~2025-09-13 01:53 UTC | newest] Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2025-09-12 13:34 Re: Latest patches break one of our unit-test, related to RLS Dominique Devienne <[email protected]> 2025-09-12 13:53 ` jian he <[email protected]> 2025-09-12 13:57 ` Dominique Devienne <[email protected]> 2025-09-12 14:07 ` Tom Lane <[email protected]> 2025-09-12 16:32 ` Álvaro Herrera <[email protected]> 2025-09-12 23:34 ` Laurenz Albe <[email protected]> 2025-09-13 00:12 ` Tom Lane <[email protected]> 2025-09-13 01:17 ` Laurenz Albe <[email protected]> 2025-09-13 01:53 ` Tom Lane <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox