Content-Type: multipart/alternative;
 boundary="------------nSSYvxyKAtXWehKB1nOQIuS3"
Message-ID: <351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com>
Date: Fri, 15 Sep 2023 13:52:52 +0300
MIME-Version: 1.0
To: pgsql-sql@postgresql.org
Content-Language: en-US
From: Achilleas Mantzios - cloud <a.mantzios@cloud.gatewaynet.com>
Subject: Regex matching where text is input and regex stored in column
Archived-At: 
 <https://www.postgresql.org/message-id/351abe02-57a5-a219-a789-57f3b6079f73%40cloud.gatewaynet.com>
Precedence: bulk

This is a multi-part message in MIME format.
--------------nSSYvxyKAtXWehKB1nOQIuS3
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Hello

I have a table like :

amantzio@sma/dynacom=# \d mail_vessel_addressbook
                                    Table "public.mail_vessel_addressbook"
Column  |  Type   | Collation | Nullable | 
                              Default
---------+---------+-----------+----------+-------------------------------------------------------------------- 

id      | integer |           | not null | 
nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
name    | text    |           | not null |
address | text    |           | not null |
vslid   | integer |           | not null |
Indexes:

.....

where address stores basically email patterns, human / edited text with 
possible wild cards, meant to be input mainly by our admins. There is an 
query that checks a particular new address (this runs via an exim4 mail 
server) which has become quite slow :

SELECT text('news3@newsaccess.in') from mail_global_addressbook where 
text('news3@newsaccess.in') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE
CT text('news3@newsaccess.in') from mail_vessel_addressbook where 
text('news3@newsaccess.in') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1;

It basically seq scans both tables, and the 2nd table in particular is 
fairly big : 123966 rows. So the above query tends to take 3.6 secs on 
average to execute. A pg_trgm index would not help since there is no way 
(that I know of) to index a regex in pgsql. Most common use is about 
indexing values to be checked against a given regex, quite the reverse 
from what we need here.

I got great improvement by materializing 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are 
running pgsql 10.23 :( , no generated columns possible, so I used a 
trigger, and then adding a normal b-tree index on the column. This 
causes Index only scan which is much better than the seq scan. Also I 
had to re-write the query as :

with

qry1 as

     (SELECT text('watchkeepers@ukmto.org') from mail_vessel_addressbook 
where

text('watchkeepers@ukmto.org') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')

     LIMIT 1),
qry2 as

     (SELECT text('watchkeepers@ukmto.org') from mail_global_addressbook 
where

text('watchkeepers@ukmto.org') ~* 
replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')

     LIMIT 1)
select text('watchkeepers@ukmto.org') from qry1 UNION select 
text('watchkeepers@ukmto.org') from qry2 LIMIT 1;

I was wondering if there is some more elegant and better solution, 
without the need for extra column and trigger.


--------------nSSYvxyKAtXWehKB1nOQIuS3
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hello</p>
    <p>I have a table like : <br>
    </p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">amantzio@sma/dynacom=#
          \d mail_vessel_addressbook
        </span><br>
                                           Table
        "public.mail_vessel_addressbook"
        <br>
        Column  |  Type   | Collation | Nullable |
                                     Default
                                       <br>
---------+---------+-----------+----------+--------------------------------------------------------------------
        <br>
        id      | integer |           | not null |
        nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
        <br>
        name    | text    |           | not null |  <br>
        address | text    |           | not null |  <br>
        vslid   | integer |           | not null |  <br>
        Indexes:<br>
        <br>
        .....<br>
      </span></p>
    <p><span style="font-family:monospace">where address stores
        basically email patterns, human / edited text with possible wild
        cards, meant to be input mainly by our admins. There is an query
        that checks a particular new address (this runs via an exim4
        mail server) which has become quite slow :</span></p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">SELECT
          text('<a class="moz-txt-link-abbreviated" href="mailto:news3@newsaccess.in">news3@newsaccess.in</a>') from mail_global_addressbook where
          text('<a class="moz-txt-link-abbreviated" href="mailto:news3@newsaccess.in">news3@newsaccess.in</a>') ~*
          replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
           UNION SELE</span><br>
        CT text('<a class="moz-txt-link-abbreviated" href="mailto:news3@newsaccess.in">news3@newsaccess.in</a>') from mail_vessel_addressbook
        where text('<a class="moz-txt-link-abbreviated" href="mailto:news3@newsaccess.in">news3@newsaccess.in</a>') ~*
        replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
         LIMIT 1;<br>
      </span></p>
    <p><span style="font-family:monospace">It basically seq scans both
        tables, and the 2nd table in particular is fairly big : </span><span
        style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">123966 rows.
          So the above query tends to take 3.6 secs on average to
          execute. A pg_trgm index would not help since there is no way
          (that I know of) to index a regex in pgsql. Most common use is
          about indexing values to be checked against a given regex,
          quite the reverse from what we need here. <br>
        </span></span></p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">I got great
          improvement by materializing </span></span><span
        style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')</span></span>,
      <span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">we are running
          pgsql 10.23 :( , no generated columns possible, so I used a
          trigger, and then adding a normal b-tree index on the column.
          This causes Index only scan which is much better than the seq
          scan. Also I had to re-write the query as : <br>
        </span></span></p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">with <br>
        </span></span></p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">qry1 as <br>
        </span></span></p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">    (SELECT
          text('<a class="moz-txt-link-abbreviated" href="mailto:watchkeepers@ukmto.org">watchkeepers@ukmto.org</a>') from mail_vessel_addressbook
          where <br>
        </span></span></p>
    <p><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;">   
          text('<a class="moz-txt-link-abbreviated" href="mailto:watchkeepers@ukmto.org">watchkeepers@ukmto.org</a>') ~*
          replace(replace(replace(address,'*','.*'),'+','.+'),'?','</span>.?')</span></p>
    <p><span style="font-family:monospace">    LIMIT 1),
        <br>
        qry2 as <br>
      </span></p>
    <p><span style="font-family:monospace">    (SELECT
        text('<a class="moz-txt-link-abbreviated" href="mailto:watchkeepers@ukmto.org">watchkeepers@ukmto.org</a>') from mail_global_addressbook
        where</span></p>
    <p><span style="font-family:monospace">    
        text('<a class="moz-txt-link-abbreviated" href="mailto:watchkeepers@ukmto.org">watchkeepers@ukmto.org</a>') ~*
        replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  <br>
      </span></p>
    <p><span style="font-family:monospace">    LIMIT 1)
        <br>
        select text('<a class="moz-txt-link-abbreviated" href="mailto:watchkeepers@ukmto.org">watchkeepers@ukmto.org</a>') from qry1 UNION select
        text('<a class="moz-txt-link-abbreviated" href="mailto:watchkeepers@ukmto.org">watchkeepers@ukmto.org</a>') from qry2 LIMIT 1;</span></p>
    <p><span style="font-family:monospace">I was wondering if there is
        some more elegant and better solution, without the need for
        extra column and trigger.</span></p>
    <p><span style="font-family:monospace"><br>
      </span><span style="font-family:monospace"><span
          style="color:#000000;background-color:#ffffff;"></span></span></p>
  </body>
</html>

--------------nSSYvxyKAtXWehKB1nOQIuS3--