Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qh6Rb-001FmZ-Bh for pgsql-sql@arkaria.postgresql.org; Fri, 15 Sep 2023 10:53:03 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1qh6RZ-00GNEd-2x for pgsql-sql@arkaria.postgresql.org; Fri, 15 Sep 2023 10:53:01 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qh6RY-00GNDm-Kk for pgsql-sql@lists.postgresql.org; Fri, 15 Sep 2023 10:53:00 +0000 Received: from cloud.gatewaynet.com ([185.90.37.94]) by makus.postgresql.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qh6RV-004hrg-23 for pgsql-sql@postgresql.org; Fri, 15 Sep 2023 10:52:59 +0000 Content-Type: multipart/alternative; boundary="------------nSSYvxyKAtXWehKB1nOQIuS3" Message-ID: <351abe02-57a5-a219-a789-57f3b6079f73@cloud.gatewaynet.com> Date: Fri, 15 Sep 2023 13:52:52 +0300 MIME-Version: 1.0 To: pgsql-sql@postgresql.org Content-Language: en-US From: Achilleas Mantzios - cloud Subject: Regex matching where text is input and regex stored in column List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------nSSYvxyKAtXWehKB1nOQIuS3 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hello I have a table like : amantzio@sma/dynacom=# \d mail_vessel_addressbook                                    Table "public.mail_vessel_addressbook" Column  |  Type   | Collation | Nullable |                              Default ---------+---------+-----------+----------+-------------------------------------------------------------------- id      | integer |           | not null | nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass) name    | text    |           | not null | address | text    |           | not null | vslid   | integer |           | not null | Indexes: ..... where address stores basically email patterns, human / edited text with possible wild cards, meant to be input mainly by our admins. There is an query that checks a particular new address (this runs via an exim4 mail server) which has become quite slow : SELECT text('news3@newsaccess.in') from mail_global_addressbook where text('news3@newsaccess.in') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE CT text('news3@newsaccess.in') from mail_vessel_addressbook where text('news3@newsaccess.in') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1; It basically seq scans both tables, and the 2nd table in particular is fairly big : 123966 rows. So the above query tends to take 3.6 secs on average to execute. A pg_trgm index would not help since there is no way (that I know of) to index a regex in pgsql. Most common use is about indexing values to be checked against a given regex, quite the reverse from what we need here. I got great improvement by materializing replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are running pgsql 10.23 :( , no generated columns possible, so I used a trigger, and then adding a normal b-tree index on the column. This causes Index only scan which is much better than the seq scan. Also I had to re-write the query as : with qry1 as     (SELECT text('watchkeepers@ukmto.org') from mail_vessel_addressbook where text('watchkeepers@ukmto.org') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')     LIMIT 1), qry2 as     (SELECT text('watchkeepers@ukmto.org') from mail_global_addressbook where text('watchkeepers@ukmto.org') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')     LIMIT 1) select text('watchkeepers@ukmto.org') from qry1 UNION select text('watchkeepers@ukmto.org') from qry2 LIMIT 1; I was wondering if there is some more elegant and better solution, without the need for extra column and trigger. --------------nSSYvxyKAtXWehKB1nOQIuS3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit

Hello

I have a table like :

amantzio@sma/dynacom=# \d mail_vessel_addressbook
                                   Table "public.mail_vessel_addressbook"
Column  |  Type   | Collation | Nullable |                              Default                                
---------+---------+-----------+----------+--------------------------------------------------------------------
id      | integer |           | not null | nextval(('public.mail_vessel_addressbook_id_seq'::text)::regclass)
name    | text    |           | not null |  
address | text    |           | not null |  
vslid   | integer |           | not null |  
Indexes:

.....

where address stores basically email patterns, human / edited text with possible wild cards, meant to be input mainly by our admins. There is an query that checks a particular new address (this runs via an exim4 mail server) which has become quite slow :

SELECT text('news3@newsaccess.in') from mail_global_addressbook where text('news3@newsaccess.in') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  UNION SELE
CT text('news3@newsaccess.in') from mail_vessel_addressbook where text('news3@newsaccess.in') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')  LIMIT 1;

It basically seq scans both tables, and the 2nd table in particular is fairly big : 123966 rows. So the above query tends to take 3.6 secs on average to execute. A pg_trgm index would not help since there is no way (that I know of) to index a regex in pgsql. Most common use is about indexing values to be checked against a given regex, quite the reverse from what we need here.

I got great improvement by materializing replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?'), we are running pgsql 10.23 :( , no generated columns possible, so I used a trigger, and then adding a normal b-tree index on the column. This causes Index only scan which is much better than the seq scan. Also I had to re-write the query as :

with

qry1 as

    (SELECT text('watchkeepers@ukmto.org') from mail_vessel_addressbook where

    text('watchkeepers@ukmto.org') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?')

    LIMIT 1),
qry2 as

    (SELECT text('watchkeepers@ukmto.org') from mail_global_addressbook where

     text('watchkeepers@ukmto.org') ~* replace(replace(replace(address,'*','.*'),'+','.+'),'?','.?') 

    LIMIT 1)
select text('watchkeepers@ukmto.org') from qry1 UNION select text('watchkeepers@ukmto.org') from qry2 LIMIT 1;

I was wondering if there is some more elegant and better solution, without the need for extra column and trigger.


--------------nSSYvxyKAtXWehKB1nOQIuS3--