Subject: Re: Mailing list search engine: surprising missing results?
To: pgsql-www@lists.postgresql.org
References: 
 <CALDQ5NxzgeXHRCD4dS_6qz+nn01ivi3i1ZEtD2DmC779i0=iSQ@mail.gmail.com>
 <ab4184b7ab84623be10c4676e090cc27ae78b355.camel@cybertec.at>
 <Ye79wNIXsyhwwwce@momjian.us> <2150096.1643057249@sss.pgh.pa.us>
 <CAF4Au4yttKJ1KAP-cO+HMLQ2_66vmx0dLTBUbE4W8Aa64foafg@mail.gmail.com>
 <22d5245c9c5a9aa05a0510bdd52458812140a870.camel@cybertec.at>
 <2257661.1643127753@sss.pgh.pa.us>
From: Ivan Panchenko <i.panchenko@postgrespro.ru>
Message-ID: <79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru>
Date: Tue, 25 Jan 2022 20:02:36 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.14.0
MIME-Version: 1.0
In-Reply-To: <2257661.1643127753@sss.pgh.pa.us>
Content-Type: multipart/alternative;
 boundary="------------78BD97C9DDB55037109850E9"
Content-Language: en-US
Archived-At: 
 <https://www.postgresql.org/message-id/79b3eb6e-152e-3c56-7b71-51d091c0f6d9%40postgrespro.ru>
Precedence: bulk

This is a multi-part message in MIME format.
--------------78BD97C9DDB55037109850E9
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit


On 25.01.2022 19:22, Tom Lane wrote:
> Laurenz Albe <laurenz.albe@cybertec.at> writes:
>> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Bruce Momjian <bruce@momjian.us> writes:
>>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>>> isn't:
>> Not quite.  The problem is question is the "'boyer-moore':1".
>> If that were "'boyer-moor':1" instead, the problem would disappear.
> Actually, when I try this here, it seems like the stemming *is*
> consistent:
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
>                         to_tsvector
> ----------------------------------------------------------
>   'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore');
>              to_tsvector
> -----------------------------------
>   'boyer':2 'boyer-moor':1 'moor':3
> (1 row)
>
> If you try variants of that where the first or third term is stemmable,
> say
>
> regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
>                          to_tsvector
> -----------------------------------------------------------
>   'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> it sure appears that each component word is stemmed independently
> already.  So I think the original explanation here is wrong and
> we need to probe more closely.

The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.

# select plainto_tsquery('simple','boyers-moore');

            plainto_tsquery
-------------------------------------
  'boyers-moore' & 'boyers' & 'moore'

# select to_tsvector('simple','boyers-moore-horspool');

                          to_tsvector
-------------------------------------------------------------
  'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3

Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be

  'boyers-moore' | ('boyers' & 'moore')

May be, it is worth changing to_tsquery() behavior for such cases.


>
> 			regards, tom lane
>
>
Regards,
Ivan


--------------78BD97C9DDB55037109850E9
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 25.01.2022 19:22, Tom Lane wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:2257661.1643127753@sss.pgh.pa.us">
      <pre class="moz-quote-pre" wrap="">Laurenz Albe <a class="moz-txt-link-rfc2396E" href="mailto:laurenz.albe@cybertec.at">&lt;laurenz.albe@cybertec.at&gt;</a> writes:
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
</pre>
        <blockquote type="cite">
          <pre class="moz-quote-pre" wrap="">On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <a class="moz-txt-link-rfc2396E" href="mailto:tgl@sss.pgh.pa.us">&lt;tgl@sss.pgh.pa.us&gt;</a> wrote:
</pre>
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">Bruce Momjian <a class="moz-txt-link-rfc2396E" href="mailto:bruce@momjian.us">&lt;bruce@momjian.us&gt;</a> writes:
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
</pre>
              <blockquote type="cite">
                <pre class="moz-quote-pre" wrap="">The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
isn't:
</pre>
              </blockquote>
            </blockquote>
          </blockquote>
        </blockquote>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">Not quite.  The problem is question is the "'boyer-moore':1".
If that were "'boyer-moor':1" instead, the problem would disappear.
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
                       to_tsvector                        
----------------------------------------------------------
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');
            to_tsvector            
-----------------------------------
 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
                        to_tsvector                        
-----------------------------------------------------------
 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.</pre>
    </blockquote>
    <pre>The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.</pre>
    <pre># select plainto_tsquery('simple','boyers-moore');

           plainto_tsquery           
-------------------------------------
 'boyers-moore' &amp; 'boyers' &amp; 'moore'
</pre>
    <pre>
</pre>
    <pre># select to_tsvector('simple','boyers-moore-horspool');</pre>
    <pre>                         to_tsvector                        
-------------------------------------------------------------
 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3</pre>
    <pre>Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
</pre>
    <pre> 'boyers-moore' | ('boyers' &amp; 'moore')</pre>
    <pre>May be, it is worth changing to_tsquery() behavior for such cases.</pre>
    <br>
    <blockquote type="cite" cite="mid:2257661.1643127753@sss.pgh.pa.us">
      <pre class="moz-quote-pre" wrap="">

			regards, tom lane


</pre>
    </blockquote>
    <pre class="moz-signature" cols="72">Regards,
Ivan
</pre>
  </body>
</html>

--------------78BD97C9DDB55037109850E9--