To: pgsql-performance@postgresql.org
From: Volker Boehm <volker@vboehm.de>
Subject: similarity and operator '%'
Message-ID: <574C7E37.7080102@vboehm.de>
Date: Mon, 30 May 2016 19:53:59 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Icedove/38.6.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

Hello,
I'm trying to find persons in an address database where I have built 
trgm-indexes on name, street, zip and city.

When I search for all four parts of the address (name, street, zip and city)

     select name, street, zip, city
     from addresses
     where name % $1
       and street % $2
       and (zip % $3 or city % $4)

everything works fine: It takes less than a second to get some (5 - 500) 
proposed addresses out of 500,000 addresses and the query plan shows

     Bitmap Heap Scan on addresses  (cost=168.31..1993.38 rows=524 ...
       Recheck Cond: ...
       ->  Bitmap Index Scan on ...
             Index Cond: ...

The same happens when I search only by name with

     select name, street, zip, city
     from addresses
     where name % $1

But when I rewrite this query to

     select name, street, zip, city
     from addresses
     where similarity(name, $1) > 0.3

which means exactly then same as the second example, the query plan 
changes to

     Seq Scan on addresses  (cost=0.00..149714.42 rows=174675 width=60)
       Filter: ...

and the query lasts about a minute.

The reason for using the similarity function in place of the 
'%'-operator is that I want to use different similarity values in one query:

     select name, street, zip, city
     from addresses
     where name % $1
         and street % $2
         and (zip % $3 or city % $4)
         or similarity(name, $1) > 0.8

which means: take all addresses where name, street, zip and city have 
little similarity _plus_ all addresses where the name matches very good.


The only way I found, was to create a temporary table from the first 
query, change the similarity value with set_limit() and then select the 
second query UNION the temporary table.

Is there a more elegant and straight forward way to achieve this result?

regards  Volker

-- 
Volker Böhm        Tel.: +49 4141 981155
Voßkuhl 5          mailto:volker@vboehm.de
21682 Stade        http://www.vboehm.de


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance