public inbox for [email protected]  
help / color / mirror / Atom feed
From: Tom Lane <[email protected]>
To: Laurenz Albe <[email protected]>
Cc: Oleg Bartunov <[email protected]>
Cc: Bruce Momjian <[email protected]>
Cc: James Addison <[email protected]>
Cc: PostgreSQL WWW <[email protected]>
Subject: Re: Mailing list search engine: surprising missing results?
Date: Tue, 25 Jan 2022 11:22:33 -0500
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <CALDQ5NxzgeXHRCD4dS_6qz+nn01ivi3i1ZEtD2DmC779i0=iSQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAF4Au4yttKJ1KAP-cO+HMLQ2_66vmx0dLTBUbE4W8Aa64foafg@mail.gmail.com>
	<[email protected]>

Laurenz Albe <[email protected]> writes:
> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote:
>>> Bruce Momjian <[email protected]> writes:
>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>> isn't:

> Not quite.  The problem is question is the "'boyer-moore':1".
> If that were "'boyer-moor':1" instead, the problem would disappear.

Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
                       to_tsvector                        
----------------------------------------------------------
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');
            to_tsvector            
-----------------------------------
 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
                        to_tsvector                        
-----------------------------------------------------------
 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.

			regards, tom lane





reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Mailing list search engine: surprising missing results?
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox