Regex matching where text is input and regex stored in column

public inbox for [email protected]  
help / color / mirror / Atom feed

Regex matching where text is input and regex stored in column
2+ messages / 2 participants
[nested] [flat]

* Regex matching where text is input and regex stored in column
@ 2023-09-15 10:52  Achilleas Mantzios - cloud <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Achilleas Mantzios - cloud @ 2023-09-15 10:52 UTC (permalink / raw)
  To: pgsql-sql

Hello

I have a table like :

amantzio@sma/dynacom=# \d mail_vessel_addressbook
                                    Table "public.mail_vessel_addressbook"
Column  |  Type   | Collation | Nullable | 
                              Default
---------+---------+-----------+----------+-------------------------------------------------------------------- 

id      | integer |           | not null | 
nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
name    | text    |           | not null |
address | text    |           | not null |
vslid   | integer |           | not null |
Indexes:

.....

where address stores basically email patterns, human / edited text with 
possible wild cards, meant to be input mainly by our admins. There is an 
query that checks a particular new address (this runs via an exim4 mail 
server) which has become quite slow :

SELECT text('[email protected]') from mail_global_addressbook where 
text('[email protected]') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE
CT text('[email protected]') from mail_vessel_addressbook where 
text('[email protected]') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1;

It basically seq scans both tables, and the 2nd table in particular is 
fairly big : 123966 rows. So the above query tends to take 3.6 secs on 
average to execute. A pg_trgm index would not help since there is no way 
(that I know of) to index a regex in pgsql. Most common use is about 
indexing values to be checked against a given regex, quite the reverse 
from what we need here.

I got great improvement by materializing 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are 
running pgsql 10.23 :( , no generated columns possible, so I used a 
trigger, and then adding a normal b-tree index on the column. This 
causes Index only scan which is much better than the seq scan. Also I 
had to re-write the query as :

with

qry1 as

     (SELECT text('[email protected]') from mail_vessel_addressbook 
where

text('[email protected]') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')

     LIMIT 1),
qry2 as

     (SELECT text('[email protected]') from mail_global_addressbook 
where

text('[email protected]') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')

     LIMIT 1)
select text('[email protected]') from qry1 UNION select 
text('[email protected]') from qry2 LIMIT 1;

I was wondering if there is some more elegant and better solution, 
without the need for extra column and trigger.

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: Regex matching where text is input and regex stored in column
@ 2023-09-16 16:35  Achilleas Mantzios <[email protected]>
  parent: Achilleas Mantzios - cloud <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Achilleas Mantzios @ 2023-09-16 16:35 UTC (permalink / raw)
  To: [email protected]

Στις 15/9/23 13:52, ο/η Achilleas Mantzios - cloud έγραψε:
>
> Hello
>
> I have a table like :
>
> amantzio@sma/dynacom=# \d mail_vessel_addressbook
>                                    Table "public.mail_vessel_addressbook"
> Column  |  Type   | Collation | Nullable | 
>                              Default
> ---------+---------+-----------+----------+-------------------------------------------------------------------- 
>
> id      | integer |           | not null | 
> nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
> name    | text    |           | not null |
> address | text    |           | not null |
> vslid   | integer |           | not null |
> Indexes:
>
> .....
>
> where address stores basically email patterns, human / edited text 
> with possible wild cards, meant to be input mainly by our admins. 
> There is an query that checks a particular new address (this runs via 
> an exim4 mail server) which has become quite slow :
>
> SELECT text('[email protected]') from mail_global_addressbook where 
> text('[email protected]') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE
> CT text('[email protected]') from mail_vessel_addressbook where 
> text('[email protected]') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1;
>
> It basically seq scans both tables, and the 2nd table in particular is 
> fairly big : 123966 rows. So the above query tends to take 3.6 secs on 
> average to execute. A pg_trgm index would not help since there is no 
> way (that I know of) to index a regex in pgsql. Most common use is 
> about indexing values to be checked against a given regex, quite the 
> reverse from what we need here.
>
> I got great improvement by materializing 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are 
> running pgsql 10.23 :( , no generated columns possible, so I used a 
> trigger, and then adding a normal b-tree index on the column. This 
> causes Index only scan which is much better than the seq scan. Also I 
> had to re-write the query as :
>
> with
>
> qry1 as
>
>     (SELECT text('[email protected]') from 
> mail_vessel_addressbook where
>
>     text('[email protected]') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
>
>     LIMIT 1),
> qry2 as
>
>     (SELECT text('[email protected]') from 
> mail_global_addressbook where
>
>      text('[email protected]') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
>
>     LIMIT 1)
> select text('[email protected]') from qry1 UNION select 
> text('[email protected]') from qry2 LIMIT 1;
>
> I was wondering if there is some more elegant and better solution, 
> without the need for extra column and trigger.
>

Answering this myself, the way I dealt with it was just create an extra 
normal index (btree) on the domain part, and include an extra equality 
check for the domain. This boosted performance from 2.5 secs to less 
than 4ms.


-- 
Achilleas Mantzios
  IT DEV - HEAD
  IT DEPT
  Dynacom Tankers Mgmt


^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2023-09-16 16:35 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-09-15 10:52 Regex matching where text is input and regex stored in column Achilleas Mantzios - cloud <[email protected]>
2023-09-16 16:35 ` Achilleas Mantzios <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox