public inbox for [email protected]  
help / color / mirror / Atom feed
From: Ivan Panchenko <[email protected]>
To: James Addison <[email protected]>
To: Tom Lane <[email protected]>
Cc: [email protected]
Subject: Re: Mailing list search engine: surprising missing results?
Date: Wed, 26 Jan 2022 00:23:35 +0300
Message-ID: <[email protected]> (raw)
In-Reply-To: <CALDQ5NzFfKCDvmbr6otF+ePH=oijN3xBeqjMen4boitUppTMBA@mail.gmail.com>
References: <CALDQ5NxzgeXHRCD4dS_6qz+nn01ivi3i1ZEtD2DmC779i0=iSQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAF4Au4yttKJ1KAP-cO+HMLQ2_66vmx0dLTBUbE4W8Aa64foafg@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CALDQ5NzFfKCDvmbr6otF+ePH=oijN3xBeqjMen4boitUppTMBA@mail.gmail.com>

On 25.01.2022 23:48, James Addison wrote:
> I'm uncertain why parsing hyphenated query text produces compound tokens?

Because in some cases user wants to search the full hyphenated words, 
not parts of them.

But the parser is pluggable, it is possible to develop another one, such 
asĀ  pg_tsparser [1] which does the same for underscores.

*to_tsquery functions are also changeable. There can exist plenty of 
them according to different user requirements.
Such function just translates the query from the user query language 
with its semantics into the tsquery language.
So you may write your own and contribute it to community or not. Another 
option is to make a wrapper function which will modify the result of 
existing *to_tsquery function to fit your task.

> There are a couple of references[1][2] in the documentation about the
> dash character being converted to a boolean not (!) operator by
> websearch_to_tsquery, but that seems unrelated.
>
> postgres=# select plainto_tsquery('simple', 'a-b');
>    plainto_tsquery
> -------------------
>   'a-b' & 'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a_b');
>   plainto_tsquery
> -----------------
>   'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a+b');
>   plainto_tsquery
> -----------------
>   'a' & 'b'
> (1 row)
In these examples, some characters are removed by the parser. Try 
ts_debug('simple', 'a+b').
>
> [1] - https://www.postgresql.org/docs/14/functions-textsearch.html
> [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
> On Tue, 25 Jan 2022 at 17:54, Tom Lane <[email protected]> wrote:
>> Ivan Panchenko <[email protected]> writes:
>>> The actual explanation can be seen from comparing a tsvector with a tsquery.
>>> To avoid stemming effects, we use the simple configuration below.
>>> # select plainto_tsquery('simple','boyers-moore');
>>>              plainto_tsquery
>>> -------------------------------------
>>>    'boyers-moore' & 'boyers' & 'moore'
>>> # select to_tsvector('simple','boyers-moore-horspool');
>>>                            to_tsvector
>>> -------------------------------------------------------------
>>>    'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>>> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>>>    'boyers-moore' | ('boyers' & 'moore')
>>> May be, it is worth changing to_tsquery() behavior for such cases.
>> Changing the behavior of to_tsquery is certainly a lot less scary
>> than changing to_tsvector --- it wouldn't call the validity of
>> existing tsvector indexes into question.
>>
>> I see that to_tsquery is even sillier than plainto_tsquery:
>>
>> regression=# select to_tsquery('simple','boyers-moore');
>>                 to_tsquery
>> -----------------------------------------
>>   'boyers-moore' <-> 'boyers' <-> 'moore'
>> (1 row)
>>
>> which is absolutely not a sane translation.
>>
>> It seems to me that in both cases we'd be better off generating
>> "'boyers' <-> 'moore'", without the compound token at all.
>> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
>> but I think if people wanted that they'd just enter separate words.

Matching the compond token might be significant for ranking. (?)

Probably, there is no universal *to_tsquery function and no universal 
parser to fit all users.

[1] https://github.com/postgrespro/pg_tsparser

>>
>>                          regards, tom lane
>>
>>
regards, Ivan
  






reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Mailing list search engine: surprising missing results?
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox