public inbox for [email protected]  
help / color / mirror / Atom feed
From: Ivan Panchenko <[email protected]>
To: [email protected]
Subject: Re: Mailing list search engine: surprising missing results?
Date: Tue, 25 Jan 2022 20:02:36 +0300
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <CALDQ5NxzgeXHRCD4dS_6qz+nn01ivi3i1ZEtD2DmC779i0=iSQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAF4Au4yttKJ1KAP-cO+HMLQ2_66vmx0dLTBUbE4W8Aa64foafg@mail.gmail.com>
	<[email protected]>
	<[email protected]>


On 25.01.2022 19:22, Tom Lane wrote:
> Laurenz Albe <[email protected]> writes:
>> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote:
>>>> Bruce Momjian <[email protected]> writes:
>>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>>> isn't:
>> Not quite.  The problem is question is the "'boyer-moore':1".
>> If that were "'boyer-moor':1" instead, the problem would disappear.
> Actually, when I try this here, it seems like the stemming *is*
> consistent:
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
>                         to_tsvector
> ----------------------------------------------------------
>   'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore');
>              to_tsvector
> -----------------------------------
>   'boyer':2 'boyer-moor':1 'moor':3
> (1 row)
>
> If you try variants of that where the first or third term is stemmable,
> say
>
> regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
>                          to_tsvector
> -----------------------------------------------------------
>   'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> it sure appears that each component word is stemmed independently
> already.  So I think the original explanation here is wrong and
> we need to probe more closely.

The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.

# select plainto_tsquery('simple','boyers-moore');

            plainto_tsquery
-------------------------------------
  'boyers-moore' & 'boyers' & 'moore'

# select to_tsvector('simple','boyers-moore-horspool');

                          to_tsvector
-------------------------------------------------------------
  'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3

Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be

  'boyers-moore' | ('boyers' & 'moore')

May be, it is worth changing to_tsquery() behavior for such cases.


>
> 			regards, tom lane
>
>
Regards,
Ivan



reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Mailing list search engine: surprising missing results?
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox