Content-Type: multipart/alternative;
 boundary="------------LVCF8plaxL1hdnZdhMF6F2qq"
Message-ID: <a09543ab-12b8-0fef-f7e6-9cd4082df828@cloud.gatewaynet.com>
Date: Sat, 16 Sep 2023 19:35:05 +0300
MIME-Version: 1.0
Subject: Re: Regex matching where text is input and regex stored in column
Content-Language: en-US
To: pgsql-sql@lists.postgresql.org
References: <351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com>
From: Achilleas Mantzios <a.mantzios@cloud.gatewaynet.com>
In-Reply-To: <351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com>
Archived-At: 
 <https://www.postgresql.org/message-id/a09543ab-12b8-0fef-f7e6-9cd4082df828%40cloud.gatewaynet.com>
Precedence: bulk

This is a multi-part message in MIME format.
--------------LVCF8plaxL1hdnZdhMF6F2qq
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Στις 15/9/23 13:52, ο/η Achilleas Mantzios - cloud έγραψε:
>
> Hello
>
> I have a table like :
>
> amantzio@sma/dynacom=# \d mail_vessel_addressbook
>                                    Table "public.mail_vessel_addressbook"
> Column  |  Type   | Collation | Nullable | 
>                              Default
> ---------+---------+-----------+----------+-------------------------------------------------------------------- 
>
> id      | integer |           | not null | 
> nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
> name    | text    |           | not null |
> address | text    |           | not null |
> vslid   | integer |           | not null |
> Indexes:
>
> .....
>
> where address stores basically email patterns, human / edited text 
> with possible wild cards, meant to be input mainly by our admins. 
> There is an query that checks a particular new address (this runs via 
> an exim4 mail server) which has become quite slow :
>
> SELECT text('news3@newsaccess.in') from mail_global_addressbook where 
> text('news3@newsaccess.in') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE
> CT text('news3@newsaccess.in') from mail_vessel_addressbook where 
> text('news3@newsaccess.in') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1;
>
> It basically seq scans both tables, and the 2nd table in particular is 
> fairly big : 123966 rows. So the above query tends to take 3.6 secs on 
> average to execute. A pg_trgm index would not help since there is no 
> way (that I know of) to index a regex in pgsql. Most common use is 
> about indexing values to be checked against a given regex, quite the 
> reverse from what we need here.
>
> I got great improvement by materializing 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are 
> running pgsql 10.23 :( , no generated columns possible, so I used a 
> trigger, and then adding a normal b-tree index on the column. This 
> causes Index only scan which is much better than the seq scan. Also I 
> had to re-write the query as :
>
> with
>
> qry1 as
>
>     (SELECT text('watchkeepers@ukmto.org') from 
> mail_vessel_addressbook where
>
>     text('watchkeepers@ukmto.org') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
>
>     LIMIT 1),
> qry2 as
>
>     (SELECT text('watchkeepers@ukmto.org') from 
> mail_global_addressbook where
>
>      text('watchkeepers@ukmto.org') ~* 
> replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
>
>     LIMIT 1)
> select text('watchkeepers@ukmto.org') from qry1 UNION select 
> text('watchkeepers@ukmto.org') from qry2 LIMIT 1;
>
> I was wondering if there is some more elegant and better solution, 
> without the need for extra column and trigger.
>

Answering this myself, the way I dealt with it was just create an extra 
normal index (btree) on the domain part, and include an extra equality 
check for the domain. This boosted performance from 2.5 secs to less 
than 4ms.


-- 
Achilleas Mantzios
  IT DEV - HEAD
  IT DEPT
  Dynacom Tankers Mgmt

--------------LVCF8plaxL1hdnZdhMF6F2qq
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <div class="moz-cite-prefix">Στις 15/9/23 13:52, ο/η Achilleas
      Mantzios - cloud έγραψε:<br>
    </div>
    <blockquote type="cite"
      cite="mid:351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <p>Hello</p>
      <p>I have a table like : <br>
      </p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">amantzio@sma/dynacom=#
            \d mail_vessel_addressbook </span><br>
                                             Table
          "public.mail_vessel_addressbook" <br>
          Column  |  Type   | Collation | Nullable |
                                       Default
                                         <br>
---------+---------+-----------+----------+--------------------------------------------------------------------
          <br>
          id      | integer |           | not null |
          nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
          <br>
          name    | text    |           | not null |  <br>
          address | text    |           | not null |  <br>
          vslid   | integer |           | not null |  <br>
          Indexes:<br>
          <br>
          .....<br>
        </span></p>
      <p><span style="font-family:monospace">where address stores
          basically email patterns, human / edited text with possible
          wild cards, meant to be input mainly by our admins. There is
          an query that checks a particular new address (this runs via
          an exim4 mail server) which has become quite slow :</span></p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">SELECT
            text('<a class="moz-txt-link-abbreviated
              moz-txt-link-freetext" href="mailto:news3@newsaccess.in"
              moz-do-not-send="true">news3@newsaccess.in</a>') from
            mail_global_addressbook where text('<a
              class="moz-txt-link-abbreviated moz-txt-link-freetext"
              href="mailto:news3@newsaccess.in" moz-do-not-send="true">news3@newsaccess.in</a>')
            ~*
            replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
             UNION SELE</span><br>
          CT text('<a class="moz-txt-link-abbreviated
            moz-txt-link-freetext" href="mailto:news3@newsaccess.in"
            moz-do-not-send="true">news3@newsaccess.in</a>') from
          mail_vessel_addressbook where text('<a
            class="moz-txt-link-abbreviated moz-txt-link-freetext"
            href="mailto:news3@newsaccess.in" moz-do-not-send="true">news3@newsaccess.in</a>')
          ~*
          replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')
           LIMIT 1;<br>
        </span></p>
      <p><span style="font-family:monospace">It basically seq scans both
          tables, and the 2nd table in particular is fairly big : </span><span
          style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">123966 rows.
            So the above query tends to take 3.6 secs on average to
            execute. A pg_trgm index would not help since there is no
            way (that I know of) to index a regex in pgsql. Most common
            use is about indexing values to be checked against a given
            regex, quite the reverse from what we need here. <br>
          </span></span></p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">I got great
            improvement by materializing </span></span><span
          style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')</span></span>,
        <span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">we are
            running pgsql 10.23 :( , no generated columns possible, so I
            used a trigger, and then adding a normal b-tree index on the
            column. This causes Index only scan which is much better
            than the seq scan. Also I had to re-write the query as : <br>
          </span></span></p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">with <br>
          </span></span></p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">qry1 as <br>
          </span></span></p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">    (SELECT
            text('<a class="moz-txt-link-abbreviated
              moz-txt-link-freetext"
              href="mailto:watchkeepers@ukmto.org"
              moz-do-not-send="true">watchkeepers@ukmto.org</a>') from
            mail_vessel_addressbook where <br>
          </span></span></p>
      <p><span style="font-family:monospace"><span
            style="color:#000000;background-color:#ffffff;">    text('<a
              class="moz-txt-link-abbreviated moz-txt-link-freetext"
              href="mailto:watchkeepers@ukmto.org"
              moz-do-not-send="true">watchkeepers@ukmto.org</a>') ~*
            replace(replace(replace(address,'*','.*'),'+','.+'),'?','</span>.?')</span></p>
      <p><span style="font-family:monospace">    LIMIT 1), <br>
          qry2 as <br>
        </span></p>
      <p><span style="font-family:monospace">    (SELECT text('<a
            class="moz-txt-link-abbreviated moz-txt-link-freetext"
            href="mailto:watchkeepers@ukmto.org" moz-do-not-send="true">watchkeepers@ukmto.org</a>')
          from mail_global_addressbook where</span></p>
      <p><span style="font-family:monospace">     text('<a
            class="moz-txt-link-abbreviated moz-txt-link-freetext"
            href="mailto:watchkeepers@ukmto.org" moz-do-not-send="true">watchkeepers@ukmto.org</a>')
          ~*
          replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?') 
          <br>
        </span></p>
      <p><span style="font-family:monospace">    LIMIT 1) <br>
          select text('<a class="moz-txt-link-abbreviated
            moz-txt-link-freetext" href="mailto:watchkeepers@ukmto.org"
            moz-do-not-send="true">watchkeepers@ukmto.org</a>') from
          qry1 UNION select text('<a class="moz-txt-link-abbreviated
            moz-txt-link-freetext" href="mailto:watchkeepers@ukmto.org"
            moz-do-not-send="true">watchkeepers@ukmto.org</a>') from
          qry2 LIMIT 1;</span></p>
      <p><span style="font-family:monospace">I was wondering if there is
          some more elegant and better solution, without the need for
          extra column and trigger.</span></p>
    </blockquote>
    <p><br>
    </p>
    <p>Answering this myself, the way I dealt with it was just create an
      extra normal index (btree) on the domain part, and include an
      extra equality check for the domain. This boosted performance from
      2.5 secs to less than 4ms.<br>
    </p>
    <p><br>
    </p>
    <pre class="moz-signature" cols="72">-- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt</pre>
  </body>
</html>

--------------LVCF8plaxL1hdnZdhMF6F2qq--