Mailing list search engine: surprising missing results?

public inbox for [email protected]  
help / color / mirror / Atom feed

Mailing list search engine: surprising missing results?
13+ messages / 6 participants
[nested] [flat]

* Mailing list search engine: surprising missing results?
@ 2022-01-23 12:49  James Addison <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: James Addison @ 2022-01-23 12:49 UTC (permalink / raw)
  To: pgsql-www

Hello,

I noticed that the mailing list search engine[1] seems to unexpectedly
miss results for some queries.

For example:

A search for "boyer"[2] returns five results, including result
snippets that contain the text "Boyer-More-Horspool" [sic] and
"Boyer-Moore-Horspool".

However, a more specific search for "boyer-moore"[3] does not return
any results -- that seems surprising.

Specializing the query further and searching for
"boyer-moore-horspool"[4] *does* again return results -- two documents
-- with the terms "boyer" and "horspool" highlighted.

Although it's not a significant problem, I do have a theory that could
explain the behaviour (offered in case it may save time on
investigation):

It seems possible that the term "more" -- and nearby misspellings,
like "moore" -- may be filtered out as stopwords (meaning: they're not
present in the search index), and that the search engine is configured
to require a minimum percentage match rate for query terms.

Under those conditions: searches for "boyer" would produce an 100%
match rate, "boyer-moore" would produce 50% (since "moore" would not
be found in the term index), and "boyer-moore-horspool" would match at
66-point-6-repeating percent.

Given a required match rate of around two thirds, that could explain
the behaviour (it might not be the true reason, but it seems like one
possibility).

Thanks,
James

[1] https://www.postgresql.org/search/
[2] https://www.postgresql.org/search/?m=1&q=boyer&l=1&d=365&s=r
[3] https://www.postgresql.org/search/?m=1&q=boyer-moore&l=1&d=365&s=r
[4] https://www.postgresql.org/search/?m=1&q=boyer-moore-horspool&l=1&d=365&s=r

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-24 07:27  Laurenz Albe <[email protected]>
  parent: James Addison <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Laurenz Albe @ 2022-01-24 07:27 UTC (permalink / raw)
  To: James Addison <[email protected]>; pgsql-www

On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote:
> Hello,
> 
> I noticed that the mailing list search engine[1] seems to unexpectedly
> miss results for some queries.
> 
> For example:
> 
> A search for "boyer"[2] returns five results, including result
> snippets that contain the text "Boyer-More-Horspool" [sic] and
> "Boyer-Moore-Horspool".
> 
> However, a more specific search for "boyer-moore"[3] does not return
> any results -- that seems surprising.
> 
> Specializing the query further and searching for
> "boyer-moore-horspool"[4] *does* again return results -- two documents
> -- with the terms "boyer" and "horspool" highlighted.

This is caused by the peculiarities of PostgreSQL full text search:

SELECT to_tsvector('english', 'Boyer-Moore-Horspool')
       @@ websearch_to_tsquery('english', 'boyer-moore');

 ?column?
══════════
 f
(1 row)

The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
isn't:

SELECT to_tsvector('english', 'Boyer-Moore-Horspool');

                       to_tsvector
══════════════════════════════════════════════════════════
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

SELECT websearch_to_tsquery('english', 'boyer-moore');

         websearch_to_tsquery
═════════════════════════════════════
 'boyer-moor' <-> 'boyer' <-> 'moor'
(1 row)

'boyer-moor' is not present in the first result.

As a workaround, I suggest that you search for 'boyer moore'
or (even better) '"boyer moore"' (with the double quotes):

SELECT websearch_to_tsquery('english', 'boyer moore');

 websearch_to_tsquery
══════════════════════
 'boyer' & 'moor'
(1 row)

SELECT websearch_to_tsquery('english', '"boyer moore"');

 websearch_to_tsquery
══════════════════════
 'boyer' <-> 'moor'
(1 row)

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-24 19:28  Bruce Momjian <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Bruce Momjian @ 2022-01-24 19:28 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: James Addison <[email protected]>; pgsql-www

On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote:
> > Specializing the query further and searching for
> > "boyer-moore-horspool"[4] *does* again return results -- two documents
> > -- with the terms "boyer" and "horspool" highlighted.
> 
> This is caused by the peculiarities of PostgreSQL full text search:
> 
> SELECT to_tsvector('english', 'Boyer-Moore-Horspool')
>        @@ websearch_to_tsquery('english', 'boyer-moore');
> 
>  ?column?
> ══════════
>  f
> (1 row)
> 
> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> isn't:

Wow, he showed me this problem earlier but I never suspected it was
stemming issue because I never considered proper nowns could be
stem-adjusted, but it is obvious they can.

-- 
  Bruce Momjian  <[email protected]>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-24 20:47  Tom Lane <[email protected]>
  parent: Bruce Momjian <[email protected]>
  0 siblings, 2 replies; 13+ messages in thread

From: Tom Lane @ 2022-01-24 20:47 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: Laurenz Albe <[email protected]>; James Addison <[email protected]>; pgsql-www

Bruce Momjian <[email protected]> writes:
> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>> isn't:

> Wow, he showed me this problem earlier but I never suspected it was
> stemming issue because I never considered proper nowns could be
> stem-adjusted, but it is obvious they can.

I wonder if we should change that so that components of a compound
word are consistently stemmed the same way.

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-24 22:15  Bruce Momjian <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 0 replies; 13+ messages in thread

From: Bruce Momjian @ 2022-01-24 22:15 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Laurenz Albe <[email protected]>; James Addison <[email protected]>; pgsql-www

On Mon, Jan 24, 2022 at 03:47:29PM -0500, Tom Lane wrote:
> Bruce Momjian <[email protected]> writes:
> > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> >> isn't:
> 
> > Wow, he showed me this problem earlier but I never suspected it was
> > stemming issue because I never considered proper nowns could be
> > stem-adjusted, but it is obvious they can.
> 
> I wonder if we should change that so that components of a compound
> word are consistently stemmed the same way.

I don't see the value in a change --- it might break the same number of
cases it fixes.

-- 
  Bruce Momjian  <[email protected]>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 11:04  Oleg Bartunov <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Oleg Bartunov @ 2022-01-25 11:04 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Bruce Momjian <[email protected]>; Laurenz Albe <[email protected]>; James Addison <[email protected]>; pgsql-www

On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote:

> Bruce Momjian <[email protected]> writes:
> > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> >> isn't:
>
> > Wow, he showed me this problem earlier but I never suspected it was
> > stemming issue because I never considered proper nowns could be
> > stem-adjusted, but it is obvious they can.
>
> I wonder if we should change that so that components of a compound
> word are consistently stemmed the same way.
>


Something like this

SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
                       to_tsvector
----------------------------------------------------------
 'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1  'moore-horspool':1
'horspool':4 'moor':3
(1 row)






>
>                         regards, tom lane
>
>
>

-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 12:43  Laurenz Albe <[email protected]>
  parent: Oleg Bartunov <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Laurenz Albe @ 2022-01-25 12:43 UTC (permalink / raw)
  To: Oleg Bartunov <[email protected]>; Tom Lane <[email protected]>; +Cc: Bruce Momjian <[email protected]>; James Addison <[email protected]>; pgsql-www

On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote:
> > Bruce Momjian <[email protected]> writes:
> > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> > > > The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> > > > is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> > > > isn't:
> > 
> > > Wow, he showed me this problem earlier but I never suspected it was
> > > stemming issue because I never considered proper nowns could be
> > > stem-adjusted, but it is obvious they can.
> > 
> > I wonder if we should change that so that components of a compound
> > word are consistently stemmed the same way.
>
> Something like this
> 
> SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
>                        to_tsvector
> ----------------------------------------------------------
>  'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1  'moore-horspool':1  'horspool':4 'moor':3
> (1 row)

Not quite.  The problem is question is the "'boyer-moore':1".
If that were "'boyer-moor':1" instead, the problem would disappear.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 16:22  Tom Lane <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Tom Lane @ 2022-01-25 16:22 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: Oleg Bartunov <[email protected]>; Bruce Momjian <[email protected]>; James Addison <[email protected]>; pgsql-www

Laurenz Albe <[email protected]> writes:
> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote:
>>> Bruce Momjian <[email protected]> writes:
>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>> isn't:

> Not quite.  The problem is question is the "'boyer-moore':1".
> If that were "'boyer-moor':1" instead, the problem would disappear.

Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
                       to_tsvector                        
----------------------------------------------------------
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');
            to_tsvector            
-----------------------------------
 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
                        to_tsvector                        
-----------------------------------------------------------
 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.

			regards, tom lane





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 17:02  Ivan Panchenko <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Ivan Panchenko @ 2022-01-25 17:02 UTC (permalink / raw)
  To: [email protected]


On 25.01.2022 19:22, Tom Lane wrote:
> Laurenz Albe <[email protected]> writes:
>> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote:
>>>> Bruce Momjian <[email protected]> writes:
>>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>>> isn't:
>> Not quite.  The problem is question is the "'boyer-moore':1".
>> If that were "'boyer-moor':1" instead, the problem would disappear.
> Actually, when I try this here, it seems like the stemming *is*
> consistent:
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
>                         to_tsvector
> ----------------------------------------------------------
>   'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore');
>              to_tsvector
> -----------------------------------
>   'boyer':2 'boyer-moor':1 'moor':3
> (1 row)
>
> If you try variants of that where the first or third term is stemmable,
> say
>
> regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
>                          to_tsvector
> -----------------------------------------------------------
>   'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> it sure appears that each component word is stemmed independently
> already.  So I think the original explanation here is wrong and
> we need to probe more closely.

The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.

# select plainto_tsquery('simple','boyers-moore');

            plainto_tsquery
-------------------------------------
  'boyers-moore' & 'boyers' & 'moore'

# select to_tsvector('simple','boyers-moore-horspool');

                          to_tsvector
-------------------------------------------------------------
  'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3

Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be

  'boyers-moore' | ('boyers' & 'moore')

May be, it is worth changing to_tsquery() behavior for such cases.


>
> 			regards, tom lane
>
>
Regards,
Ivan



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 17:54  Tom Lane <[email protected]>
  parent: Ivan Panchenko <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Tom Lane @ 2022-01-25 17:54 UTC (permalink / raw)
  To: Ivan Panchenko <[email protected]>; +Cc: [email protected]

Ivan Panchenko <[email protected]> writes:
> The actual explanation can be seen from comparing a tsvector with a tsquery.
> To avoid stemming effects, we use the simple configuration below.

> # select plainto_tsquery('simple','boyers-moore');

>             plainto_tsquery
> -------------------------------------
>   'boyers-moore' & 'boyers' & 'moore'

> # select to_tsvector('simple','boyers-moore-horspool');

>                           to_tsvector
> -------------------------------------------------------------
>   'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3

> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be

>   'boyers-moore' | ('boyers' & 'moore')

> May be, it is worth changing to_tsquery() behavior for such cases.

Changing the behavior of to_tsquery is certainly a lot less scary
than changing to_tsvector --- it wouldn't call the validity of
existing tsvector indexes into question.

I see that to_tsquery is even sillier than plainto_tsquery:

regression=# select to_tsquery('simple','boyers-moore');
               to_tsquery                
-----------------------------------------
 'boyers-moore' <-> 'boyers' <-> 'moore'
(1 row)

which is absolutely not a sane translation.

It seems to me that in both cases we'd be better off generating
"'boyers' <-> 'moore'", without the compound token at all.
Maybe there's a case for the weaker 'boyers' & 'moore' translation,
but I think if people wanted that they'd just enter separate words.

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 20:48  James Addison <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: James Addison @ 2022-01-25 20:48 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Ivan Panchenko <[email protected]>; [email protected]

I'm uncertain why parsing hyphenated query text produces compound tokens?

There are a couple of references[1][2] in the documentation about the
dash character being converted to a boolean not (!) operator by
websearch_to_tsquery, but that seems unrelated.

postgres=# select plainto_tsquery('simple', 'a-b');
  plainto_tsquery
-------------------
 'a-b' & 'a' & 'b'
(1 row)

postgres=# select plainto_tsquery('simple', 'a_b');
 plainto_tsquery
-----------------
 'a' & 'b'
(1 row)

postgres=# select plainto_tsquery('simple', 'a+b');
 plainto_tsquery
-----------------
 'a' & 'b'
(1 row)

[1] - https://www.postgresql.org/docs/14/functions-textsearch.html
[2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES

On Tue, 25 Jan 2022 at 17:54, Tom Lane <[email protected]> wrote:
>
> Ivan Panchenko <[email protected]> writes:
> > The actual explanation can be seen from comparing a tsvector with a tsquery.
> > To avoid stemming effects, we use the simple configuration below.
>
> > # select plainto_tsquery('simple','boyers-moore');
>
> >             plainto_tsquery
> > -------------------------------------
> >   'boyers-moore' & 'boyers' & 'moore'
>
> > # select to_tsvector('simple','boyers-moore-horspool');
>
> >                           to_tsvector
> > -------------------------------------------------------------
> >   'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>
> > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>
> >   'boyers-moore' | ('boyers' & 'moore')
>
> > May be, it is worth changing to_tsquery() behavior for such cases.
>
> Changing the behavior of to_tsquery is certainly a lot less scary
> than changing to_tsvector --- it wouldn't call the validity of
> existing tsvector indexes into question.
>
> I see that to_tsquery is even sillier than plainto_tsquery:
>
> regression=# select to_tsquery('simple','boyers-moore');
>                to_tsquery
> -----------------------------------------
>  'boyers-moore' <-> 'boyers' <-> 'moore'
> (1 row)
>
> which is absolutely not a sane translation.
>
> It seems to me that in both cases we'd be better off generating
> "'boyers' <-> 'moore'", without the compound token at all.
> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
> but I think if people wanted that they'd just enter separate words.
>
>                         regards, tom lane
>
>





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-25 21:23  Ivan Panchenko <[email protected]>
  parent: James Addison <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Ivan Panchenko @ 2022-01-25 21:23 UTC (permalink / raw)
  To: James Addison <[email protected]>; Tom Lane <[email protected]>; +Cc: [email protected]

On 25.01.2022 23:48, James Addison wrote:
> I'm uncertain why parsing hyphenated query text produces compound tokens?

Because in some cases user wants to search the full hyphenated words, 
not parts of them.

But the parser is pluggable, it is possible to develop another one, such 
as  pg_tsparser [1] which does the same for underscores.

*to_tsquery functions are also changeable. There can exist plenty of 
them according to different user requirements.
Such function just translates the query from the user query language 
with its semantics into the tsquery language.
So you may write your own and contribute it to community or not. Another 
option is to make a wrapper function which will modify the result of 
existing *to_tsquery function to fit your task.

> There are a couple of references[1][2] in the documentation about the
> dash character being converted to a boolean not (!) operator by
> websearch_to_tsquery, but that seems unrelated.
>
> postgres=# select plainto_tsquery('simple', 'a-b');
>    plainto_tsquery
> -------------------
>   'a-b' & 'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a_b');
>   plainto_tsquery
> -----------------
>   'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a+b');
>   plainto_tsquery
> -----------------
>   'a' & 'b'
> (1 row)
In these examples, some characters are removed by the parser. Try 
ts_debug('simple', 'a+b').
>
> [1] - https://www.postgresql.org/docs/14/functions-textsearch.html
> [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
> On Tue, 25 Jan 2022 at 17:54, Tom Lane <[email protected]> wrote:
>> Ivan Panchenko <[email protected]> writes:
>>> The actual explanation can be seen from comparing a tsvector with a tsquery.
>>> To avoid stemming effects, we use the simple configuration below.
>>> # select plainto_tsquery('simple','boyers-moore');
>>>              plainto_tsquery
>>> -------------------------------------
>>>    'boyers-moore' & 'boyers' & 'moore'
>>> # select to_tsvector('simple','boyers-moore-horspool');
>>>                            to_tsvector
>>> -------------------------------------------------------------
>>>    'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>>> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>>>    'boyers-moore' | ('boyers' & 'moore')
>>> May be, it is worth changing to_tsquery() behavior for such cases.
>> Changing the behavior of to_tsquery is certainly a lot less scary
>> than changing to_tsvector --- it wouldn't call the validity of
>> existing tsvector indexes into question.
>>
>> I see that to_tsquery is even sillier than plainto_tsquery:
>>
>> regression=# select to_tsquery('simple','boyers-moore');
>>                 to_tsquery
>> -----------------------------------------
>>   'boyers-moore' <-> 'boyers' <-> 'moore'
>> (1 row)
>>
>> which is absolutely not a sane translation.
>>
>> It seems to me that in both cases we'd be better off generating
>> "'boyers' <-> 'moore'", without the compound token at all.
>> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
>> but I think if people wanted that they'd just enter separate words.

Matching the compond token might be significant for ranking. (?)

Probably, there is no universal *to_tsquery function and no universal 
parser to fit all users.

[1] https://github.com/postgrespro/pg_tsparser

>>
>>                          regards, tom lane
>>
>>
regards, Ivan
  






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Mailing list search engine: surprising missing results?
@ 2022-01-26 08:28  James Addison <[email protected]>
  parent: Ivan Panchenko <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: James Addison @ 2022-01-26 08:28 UTC (permalink / raw)
  To: Ivan Panchenko <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]

On Tue, 25 Jan 2022 at 21:23, Ivan Panchenko <[email protected]> wrote:
>
> On 25.01.2022 23:48, James Addison wrote:
> > I'm uncertain why parsing hyphenated query text produces compound tokens?
>
> Because in some cases user wants to search the full hyphenated words,
> not parts of them.

That makes sense, although to refer back to a previous suggestion of
yours, we could allow matching on the full hyphenated words by
emitting an 'OR' condition from the parsed query, instead of 'AND'
(perhaps using an argument?).

In other words:

# expected query to achieve a match (from your previous post in this thread)
'boyers-moore' | ('boyers' & 'moore')

# actual query that does not result in a match today (plainto_tsquery
for 'boyer-moore')
'boyer-moore' & 'boyer' & 'moore'

> >> It seems to me that in both cases we'd be better off generating
> >> "'boyers' <-> 'moore'", without the compound token at all.
> >> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
> >> but I think if people wanted that they'd just enter separate words.
>
> Matching the compond token might be significant for ranking. (?)

Yes that does seem likely.  The knowledge that there is an exact-match
token in the results could be important for various use cases
(including relevance scoring).

> Probably, there is no universal *to_tsquery function and no universal
> parser to fit all users.

That seems possible too, yep.

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

end of thread, other threads:[~2022-01-26 08:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2022-01-23 12:49 Mailing list search engine: surprising missing results? James Addison <[email protected]>
2022-01-24 07:27 ` Laurenz Albe <[email protected]>
2022-01-24 19:28   ` Bruce Momjian <[email protected]>
2022-01-24 20:47     ` Tom Lane <[email protected]>
2022-01-24 22:15       ` Bruce Momjian <[email protected]>
2022-01-25 11:04       ` Oleg Bartunov <[email protected]>
2022-01-25 12:43         ` Laurenz Albe <[email protected]>
2022-01-25 16:22           ` Tom Lane <[email protected]>
2022-01-25 17:02             ` Ivan Panchenko <[email protected]>
2022-01-25 17:54               ` Tom Lane <[email protected]>
2022-01-25 20:48                 ` James Addison <[email protected]>
2022-01-25 21:23                   ` Ivan Panchenko <[email protected]>
2022-01-26 08:28                     ` James Addison <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox