Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCPDM-000567-TV for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 17:02:41 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCPDL-0004pC-Pf for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 17:02:39 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCPDL-0004p3-G8 for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 17:02:39 +0000 Received: from mail.postgrespro.ru ([93.174.131.139]) by magus.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCPDJ-0006P4-6R for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 17:02:39 +0000 Received: from [192.168.28.29] (cyclops.postgrespro.ru [93.174.131.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mail.postgrespro.ru (Postfix) with ESMTPSA id A959521C8A69 for ; Tue, 25 Jan 2022 20:02:36 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=postgrespro.ru; s=mail; t=1643130156; bh=P3l/Ne/0zwB7PR9a5SLSdTk7UH7hu+6iSWYzKfbbBoU=; h=Subject:To:References:From:Date:In-Reply-To; b=ENAYgXUlXphVNC+lKXc7Hcqfj4bq2+QlssbnuxEF2MhXVeLJ20VyMaMZ2w4DFS2iN i4VK0/Ip8IMPgg26RAYN+G7q1Zyn0LLPHk2rV9esH+jqt1cfUa/5irYA0LWTTq1EGU 0JpuHdNqqs3JRwgsk8d1NpwtOYnzPaqE6bHBYS2Y= Subject: Re: Mailing list search engine: surprising missing results? To: pgsql-www@lists.postgresql.org References: <2150096.1643057249@sss.pgh.pa.us> <22d5245c9c5a9aa05a0510bdd52458812140a870.camel@cybertec.at> <2257661.1643127753@sss.pgh.pa.us> From: Ivan Panchenko Message-ID: <79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru> Date: Tue, 25 Jan 2022 20:02:36 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: <2257661.1643127753@sss.pgh.pa.us> Content-Type: multipart/alternative; boundary="------------78BD97C9DDB55037109850E9" Content-Language: en-US List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------78BD97C9DDB55037109850E9 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit On 25.01.2022 19:22, Tom Lane wrote: > Laurenz Albe writes: >> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: >>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane wrote: >>>> Bruce Momjian writes: >>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >>>>>> isn't: >> Not quite. The problem is question is the "'boyer-moore':1". >> If that were "'boyer-moor':1" instead, the problem would disappear. > Actually, when I try this here, it seems like the stemming *is* > consistent: > > regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); > to_tsvector > ---------------------------------------------------------- > 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 > (1 row) > > regression=# SELECT to_tsvector('english', 'Boyer-Moore'); > to_tsvector > ----------------------------------- > 'boyer':2 'boyer-moor':1 'moor':3 > (1 row) > > If you try variants of that where the first or third term is stemmable, > say > > regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool'); > to_tsvector > ----------------------------------------------------------- > 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3 > (1 row) > > it sure appears that each component word is stemmed independently > already. So I think the original explanation here is wrong and > we need to probe more closely. The actual explanation can be seen from comparing a tsvector with a tsquery. To avoid stemming effects, we use the simple configuration below. # select plainto_tsquery('simple','boyers-moore');            plainto_tsquery -------------------------------------  'boyers-moore' & 'boyers' & 'moore' # select to_tsvector('simple','boyers-moore-horspool');                          to_tsvector -------------------------------------------------------------  'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be  'boyers-moore' | ('boyers' & 'moore') May be, it is worth changing to_tsquery() behavior for such cases. > > regards, tom lane > > Regards, Ivan --------------78BD97C9DDB55037109850E9 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit


On 25.01.2022 19:22, Tom Lane wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
isn't:

      
Not quite.  The problem is question is the "'boyer-moore':1".
If that were "'boyer-moor':1" instead, the problem would disappear.
Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
                       to_tsvector                        
----------------------------------------------------------
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');
            to_tsvector            
-----------------------------------
 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
                        to_tsvector                        
-----------------------------------------------------------
 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.
The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.
# select plainto_tsquery('simple','boyers-moore');

           plainto_tsquery           
-------------------------------------
 'boyers-moore' & 'boyers' & 'moore'

    
# select to_tsvector('simple','boyers-moore-horspool');
                         to_tsvector                        
-------------------------------------------------------------
 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
 'boyers-moore' | ('boyers' & 'moore')
May be, it is worth changing to_tsquery() behavior for such cases.


			regards, tom lane


Regards,
Ivan
--------------78BD97C9DDB55037109850E9--