public inbox for [email protected]  
help / color / mirror / Atom feed
From: James Addison <[email protected]>
To: Tom Lane <[email protected]>
Cc: Ivan Panchenko <[email protected]>
Cc: [email protected]
Subject: Re: Mailing list search engine: surprising missing results?
Date: Tue, 25 Jan 2022 20:48:34 +0000
Message-ID: <CALDQ5NzFfKCDvmbr6otF+ePH=oijN3xBeqjMen4boitUppTMBA@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <CALDQ5NxzgeXHRCD4dS_6qz+nn01ivi3i1ZEtD2DmC779i0=iSQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAF4Au4yttKJ1KAP-cO+HMLQ2_66vmx0dLTBUbE4W8Aa64foafg@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>

I'm uncertain why parsing hyphenated query text produces compound tokens?

There are a couple of references[1][2] in the documentation about the
dash character being converted to a boolean not (!) operator by
websearch_to_tsquery, but that seems unrelated.

postgres=# select plainto_tsquery('simple', 'a-b');
  plainto_tsquery
-------------------
 'a-b' & 'a' & 'b'
(1 row)

postgres=# select plainto_tsquery('simple', 'a_b');
 plainto_tsquery
-----------------
 'a' & 'b'
(1 row)

postgres=# select plainto_tsquery('simple', 'a+b');
 plainto_tsquery
-----------------
 'a' & 'b'
(1 row)

[1] - https://www.postgresql.org/docs/14/functions-textsearch.html
[2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES

On Tue, 25 Jan 2022 at 17:54, Tom Lane <[email protected]> wrote:
>
> Ivan Panchenko <[email protected]> writes:
> > The actual explanation can be seen from comparing a tsvector with a tsquery.
> > To avoid stemming effects, we use the simple configuration below.
>
> > # select plainto_tsquery('simple','boyers-moore');
>
> >             plainto_tsquery
> > -------------------------------------
> >   'boyers-moore' & 'boyers' & 'moore'
>
> > # select to_tsvector('simple','boyers-moore-horspool');
>
> >                           to_tsvector
> > -------------------------------------------------------------
> >   'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>
> > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>
> >   'boyers-moore' | ('boyers' & 'moore')
>
> > May be, it is worth changing to_tsquery() behavior for such cases.
>
> Changing the behavior of to_tsquery is certainly a lot less scary
> than changing to_tsvector --- it wouldn't call the validity of
> existing tsvector indexes into question.
>
> I see that to_tsquery is even sillier than plainto_tsquery:
>
> regression=# select to_tsquery('simple','boyers-moore');
>                to_tsquery
> -----------------------------------------
>  'boyers-moore' <-> 'boyers' <-> 'moore'
> (1 row)
>
> which is absolutely not a sane translation.
>
> It seems to me that in both cases we'd be better off generating
> "'boyers' <-> 'moore'", without the compound token at all.
> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
> but I think if people wanted that they'd just enter separate words.
>
>                         regards, tom lane
>
>





reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Mailing list search engine: surprising missing results?
  In-Reply-To: <CALDQ5NzFfKCDvmbr6otF+ePH=oijN3xBeqjMen4boitUppTMBA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox