Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qhYGQ-003Rz0-SP for pgsql-sql@arkaria.postgresql.org; Sat, 16 Sep 2023 16:35:23 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1qhYGN-00G4oY-Qr for pgsql-sql@arkaria.postgresql.org; Sat, 16 Sep 2023 16:35:19 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qhYGN-00G4oD-4A for pgsql-sql@lists.postgresql.org; Sat, 16 Sep 2023 16:35:19 +0000 Received: from cloud.gatewaynet.com ([185.90.37.94]) by makus.postgresql.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qhYGF-004uzk-7D for pgsql-sql@lists.postgresql.org; Sat, 16 Sep 2023 16:35:17 +0000 Content-Type: multipart/alternative; boundary="------------LVCF8plaxL1hdnZdhMF6F2qq" Message-ID: Date: Sat, 16 Sep 2023 19:35:05 +0300 MIME-Version: 1.0 Subject: Re: Regex matching where text is input and regex stored in column Content-Language: en-US To: pgsql-sql@lists.postgresql.org References: <351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com> From: Achilleas Mantzios In-Reply-To: <351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------LVCF8plaxL1hdnZdhMF6F2qq Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Στις 15/9/23 13:52, ο/η Achilleas Mantzios - cloud έγραψε: > > Hello > > I have a table like : > > amantzio@sma/dynacom=# \d mail_vessel_addressbook >                                    Table "public.mail_vessel_addressbook" > Column  |  Type   | Collation | Nullable | >                              Default > ---------+---------+-----------+----------+-------------------------------------------------------------------- > > id      | integer |           | not null | > nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass) > name    | text    |           | not null | > address | text    |           | not null | > vslid   | integer |           | not null | > Indexes: > > ..... > > where address stores basically email patterns, human / edited text > with possible wild cards, meant to be input mainly by our admins. > There is an query that checks a particular new address (this runs via > an exim4 mail server) which has become quite slow : > > SELECT text('news3@newsaccess.in') from mail_global_addressbook where > text('news3@newsaccess.in') ~* > replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE > CT text('news3@newsaccess.in') from mail_vessel_addressbook where > text('news3@newsaccess.in') ~* > replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1; > > It basically seq scans both tables, and the 2nd table in particular is > fairly big : 123966 rows. So the above query tends to take 3.6 secs on > average to execute. A pg_trgm index would not help since there is no > way (that I know of) to index a regex in pgsql. Most common use is > about indexing values to be checked against a given regex, quite the > reverse from what we need here. > > I got great improvement by materializing > replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are > running pgsql 10.23 :( , no generated columns possible, so I used a > trigger, and then adding a normal b-tree index on the column. This > causes Index only scan which is much better than the seq scan. Also I > had to re-write the query as : > > with > > qry1 as > >     (SELECT text('watchkeepers@ukmto.org') from > mail_vessel_addressbook where > >     text('watchkeepers@ukmto.org') ~* > replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?') > >     LIMIT 1), > qry2 as > >     (SELECT text('watchkeepers@ukmto.org') from > mail_global_addressbook where > >      text('watchkeepers@ukmto.org') ~* > replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?') > >     LIMIT 1) > select text('watchkeepers@ukmto.org') from qry1 UNION select > text('watchkeepers@ukmto.org') from qry2 LIMIT 1; > > I was wondering if there is some more elegant and better solution, > without the need for extra column and trigger. > Answering this myself, the way I dealt with it was just create an extra normal index (btree) on the domain part, and include an extra equality check for the domain. This boosted performance from 2.5 secs to less than 4ms. -- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt --------------LVCF8plaxL1hdnZdhMF6F2qq Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Στις 15/9/23 13:52, ο/η Achilleas Mantzios - cloud έγραψε:

Hello

I have a table like :

amantzio@sma/dynacom=# \d mail_vessel_addressbook
                                   Table "public.mail_vessel_addressbook"
Column  |  Type   | Collation | Nullable |                              Default                                
---------+---------+-----------+----------+--------------------------------------------------------------------
id      | integer |           | not null | nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
name    | text    |           | not null |  
address | text    |           | not null |  
vslid   | integer |           | not null |  
Indexes:

.....

where address stores basically email patterns, human / edited text with possible wild cards, meant to be input mainly by our admins. There is an query that checks a particular new address (this runs via an exim4 mail server) which has become quite slow :

SELECT text('news3@newsaccess.in') from mail_global_addressbook where text('news3@newsaccess.in') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE
CT text('news3@newsaccess.in') from mail_vessel_addressbook where text('news3@newsaccess.in') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1;

It basically seq scans both tables, and the 2nd table in particular is fairly big : 123966 rows. So the above query tends to take 3.6 secs on average to execute. A pg_trgm index would not help since there is no way (that I know of) to index a regex in pgsql. Most common use is about indexing values to be checked against a given regex, quite the reverse from what we need here.

I got great improvement by materializing replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are running pgsql 10.23 :( , no generated columns possible, so I used a trigger, and then adding a normal b-tree index on the column. This causes Index only scan which is much better than the seq scan. Also I had to re-write the query as :

with

qry1 as

    (SELECT text('watchkeepers@ukmto.org') from mail_vessel_addressbook where

    text('watchkeepers@ukmto.org') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')

    LIMIT 1),
qry2 as

    (SELECT text('watchkeepers@ukmto.org') from mail_global_addressbook where

     text('watchkeepers@ukmto.org') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?') 

    LIMIT 1)
select text('watchkeepers@ukmto.org') from qry1 UNION select text('watchkeepers@ukmto.org') from qry2 LIMIT 1;

I was wondering if there is some more elegant and better solution, without the need for extra column and trigger.


Answering this myself, the way I dealt with it was just create an extra normal index (btree) on the domain part, and include an extra equality check for the domain. This boosted performance from 2.5 secs to less than 4ms.


-- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt
--------------LVCF8plaxL1hdnZdhMF6F2qq--