public inbox for [email protected]help / color / mirror / Atom feed
Mailing list search engine: surprising missing results? 13+ messages / 6 participants [nested] [flat]
* Mailing list search engine: surprising missing results? @ 2022-01-23 12:49 James Addison <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: James Addison @ 2022-01-23 12:49 UTC (permalink / raw) To: pgsql-www Hello, I noticed that the mailing list search engine[1] seems to unexpectedly miss results for some queries. For example: A search for "boyer"[2] returns five results, including result snippets that contain the text "Boyer-More-Horspool" [sic] and "Boyer-Moore-Horspool". However, a more specific search for "boyer-moore"[3] does not return any results -- that seems surprising. Specializing the query further and searching for "boyer-moore-horspool"[4] *does* again return results -- two documents -- with the terms "boyer" and "horspool" highlighted. Although it's not a significant problem, I do have a theory that could explain the behaviour (offered in case it may save time on investigation): It seems possible that the term "more" -- and nearby misspellings, like "moore" -- may be filtered out as stopwords (meaning: they're not present in the search index), and that the search engine is configured to require a minimum percentage match rate for query terms. Under those conditions: searches for "boyer" would produce an 100% match rate, "boyer-moore" would produce 50% (since "moore" would not be found in the term index), and "boyer-moore-horspool" would match at 66-point-6-repeating percent. Given a required match rate of around two thirds, that could explain the behaviour (it might not be the true reason, but it seems like one possibility). Thanks, James [1] https://www.postgresql.org/search/ [2] https://www.postgresql.org/search/?m=1&q=boyer&l=1&d=365&s=r [3] https://www.postgresql.org/search/?m=1&q=boyer-moore&l=1&d=365&s=r [4] https://www.postgresql.org/search/?m=1&q=boyer-moore-horspool&l=1&d=365&s=r ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-24 07:27 Laurenz Albe <[email protected]> parent: James Addison <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Laurenz Albe @ 2022-01-24 07:27 UTC (permalink / raw) To: James Addison <[email protected]>; pgsql-www On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote: > Hello, > > I noticed that the mailing list search engine[1] seems to unexpectedly > miss results for some queries. > > For example: > > A search for "boyer"[2] returns five results, including result > snippets that contain the text "Boyer-More-Horspool" [sic] and > "Boyer-Moore-Horspool". > > However, a more specific search for "boyer-moore"[3] does not return > any results -- that seems surprising. > > Specializing the query further and searching for > "boyer-moore-horspool"[4] *does* again return results -- two documents > -- with the terms "boyer" and "horspool" highlighted. This is caused by the peculiarities of PostgreSQL full text search: SELECT to_tsvector('english', 'Boyer-Moore-Horspool') @@ websearch_to_tsquery('english', 'boyer-moore'); ?column? ══════════ f (1 row) The reason is that the 'moore' in 'boyer-moore' is stemmed, since it is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' isn't: SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ══════════════════════════════════════════════════════════ 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 (1 row) SELECT websearch_to_tsquery('english', 'boyer-moore'); websearch_to_tsquery ═════════════════════════════════════ 'boyer-moor' <-> 'boyer' <-> 'moor' (1 row) 'boyer-moor' is not present in the first result. As a workaround, I suggest that you search for 'boyer moore' or (even better) '"boyer moore"' (with the double quotes): SELECT websearch_to_tsquery('english', 'boyer moore'); websearch_to_tsquery ══════════════════════ 'boyer' & 'moor' (1 row) SELECT websearch_to_tsquery('english', '"boyer moore"'); websearch_to_tsquery ══════════════════════ 'boyer' <-> 'moor' (1 row) Yours, Laurenz Albe ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-24 19:28 Bruce Momjian <[email protected]> parent: Laurenz Albe <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Bruce Momjian @ 2022-01-24 19:28 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: James Addison <[email protected]>; pgsql-www On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote: > > Specializing the query further and searching for > > "boyer-moore-horspool"[4] *does* again return results -- two documents > > -- with the terms "boyer" and "horspool" highlighted. > > This is caused by the peculiarities of PostgreSQL full text search: > > SELECT to_tsvector('english', 'Boyer-Moore-Horspool') > @@ websearch_to_tsquery('english', 'boyer-moore'); > > ?column? > ══════════ > f > (1 row) > > The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > isn't: Wow, he showed me this problem earlier but I never suspected it was stemming issue because I never considered proper nowns could be stem-adjusted, but it is obvious they can. -- Bruce Momjian <[email protected]> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion. ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-24 20:47 Tom Lane <[email protected]> parent: Bruce Momjian <[email protected]> 0 siblings, 2 replies; 13+ messages in thread From: Tom Lane @ 2022-01-24 20:47 UTC (permalink / raw) To: Bruce Momjian <[email protected]>; +Cc: Laurenz Albe <[email protected]>; James Addison <[email protected]>; pgsql-www Bruce Momjian <[email protected]> writes: > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >> isn't: > Wow, he showed me this problem earlier but I never suspected it was > stemming issue because I never considered proper nowns could be > stem-adjusted, but it is obvious they can. I wonder if we should change that so that components of a compound word are consistently stemmed the same way. regards, tom lane ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-24 22:15 Bruce Momjian <[email protected]> parent: Tom Lane <[email protected]> 1 sibling, 0 replies; 13+ messages in thread From: Bruce Momjian @ 2022-01-24 22:15 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: Laurenz Albe <[email protected]>; James Addison <[email protected]>; pgsql-www On Mon, Jan 24, 2022 at 03:47:29PM -0500, Tom Lane wrote: > Bruce Momjian <[email protected]> writes: > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > >> isn't: > > > Wow, he showed me this problem earlier but I never suspected it was > > stemming issue because I never considered proper nowns could be > > stem-adjusted, but it is obvious they can. > > I wonder if we should change that so that components of a compound > word are consistently stemmed the same way. I don't see the value in a change --- it might break the same number of cases it fixes. -- Bruce Momjian <[email protected]> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion. ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 11:04 Oleg Bartunov <[email protected]> parent: Tom Lane <[email protected]> 1 sibling, 1 reply; 13+ messages in thread From: Oleg Bartunov @ 2022-01-25 11:04 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: Bruce Momjian <[email protected]>; Laurenz Albe <[email protected]>; James Addison <[email protected]>; pgsql-www On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote: > Bruce Momjian <[email protected]> writes: > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > >> isn't: > > > Wow, he showed me this problem earlier but I never suspected it was > > stemming issue because I never considered proper nowns could be > > stem-adjusted, but it is obvious they can. > > I wonder if we should change that so that components of a compound > word are consistently stemmed the same way. > Something like this SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3 (1 row) > > regards, tom lane > > > -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 12:43 Laurenz Albe <[email protected]> parent: Oleg Bartunov <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Laurenz Albe @ 2022-01-25 12:43 UTC (permalink / raw) To: Oleg Bartunov <[email protected]>; Tom Lane <[email protected]>; +Cc: Bruce Momjian <[email protected]>; James Addison <[email protected]>; pgsql-www On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: > On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote: > > Bruce Momjian <[email protected]> writes: > > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > > > > The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > > > > is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > > > > isn't: > > > > > Wow, he showed me this problem earlier but I never suspected it was > > > stemming issue because I never considered proper nowns could be > > > stem-adjusted, but it is obvious they can. > > > > I wonder if we should change that so that components of a compound > > word are consistently stemmed the same way. > > Something like this > > SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); > to_tsvector > ---------------------------------------------------------- > 'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3 > (1 row) Not quite. The problem is question is the "'boyer-moore':1". If that were "'boyer-moor':1" instead, the problem would disappear. Yours, Laurenz Albe ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 16:22 Tom Lane <[email protected]> parent: Laurenz Albe <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Tom Lane @ 2022-01-25 16:22 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: Oleg Bartunov <[email protected]>; Bruce Momjian <[email protected]>; James Addison <[email protected]>; pgsql-www Laurenz Albe <[email protected]> writes: > On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: >> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote: >>> Bruce Momjian <[email protected]> writes: >>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >>>>> isn't: > Not quite. The problem is question is the "'boyer-moore':1". > If that were "'boyer-moor':1" instead, the problem would disappear. Actually, when I try this here, it seems like the stemming *is* consistent: regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 (1 row) regression=# SELECT to_tsvector('english', 'Boyer-Moore'); to_tsvector ----------------------------------- 'boyer':2 'boyer-moor':1 'moor':3 (1 row) If you try variants of that where the first or third term is stemmable, say regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool'); to_tsvector ----------------------------------------------------------- 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3 (1 row) it sure appears that each component word is stemmed independently already. So I think the original explanation here is wrong and we need to probe more closely. regards, tom lane ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 17:02 Ivan Panchenko <[email protected]> parent: Tom Lane <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Ivan Panchenko @ 2022-01-25 17:02 UTC (permalink / raw) To: [email protected] On 25.01.2022 19:22, Tom Lane wrote: > Laurenz Albe <[email protected]> writes: >> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: >>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <[email protected]> wrote: >>>> Bruce Momjian <[email protected]> writes: >>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >>>>>> isn't: >> Not quite. The problem is question is the "'boyer-moore':1". >> If that were "'boyer-moor':1" instead, the problem would disappear. > Actually, when I try this here, it seems like the stemming *is* > consistent: > > regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); > to_tsvector > ---------------------------------------------------------- > 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 > (1 row) > > regression=# SELECT to_tsvector('english', 'Boyer-Moore'); > to_tsvector > ----------------------------------- > 'boyer':2 'boyer-moor':1 'moor':3 > (1 row) > > If you try variants of that where the first or third term is stemmable, > say > > regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool'); > to_tsvector > ----------------------------------------------------------- > 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3 > (1 row) > > it sure appears that each component word is stemmed independently > already. So I think the original explanation here is wrong and > we need to probe more closely. The actual explanation can be seen from comparing a tsvector with a tsquery. To avoid stemming effects, we use the simple configuration below. # select plainto_tsquery('simple','boyers-moore'); plainto_tsquery ------------------------------------- 'boyers-moore' & 'boyers' & 'moore' # select to_tsvector('simple','boyers-moore-horspool'); to_tsvector ------------------------------------------------------------- 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be 'boyers-moore' | ('boyers' & 'moore') May be, it is worth changing to_tsquery() behavior for such cases. > > regards, tom lane > > Regards, Ivan ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 17:54 Tom Lane <[email protected]> parent: Ivan Panchenko <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Tom Lane @ 2022-01-25 17:54 UTC (permalink / raw) To: Ivan Panchenko <[email protected]>; +Cc: [email protected] Ivan Panchenko <[email protected]> writes: > The actual explanation can be seen from comparing a tsvector with a tsquery. > To avoid stemming effects, we use the simple configuration below. > # select plainto_tsquery('simple','boyers-moore'); > plainto_tsquery > ------------------------------------- > 'boyers-moore' & 'boyers' & 'moore' > # select to_tsvector('simple','boyers-moore-horspool'); > to_tsvector > ------------------------------------------------------------- > 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be > 'boyers-moore' | ('boyers' & 'moore') > May be, it is worth changing to_tsquery() behavior for such cases. Changing the behavior of to_tsquery is certainly a lot less scary than changing to_tsvector --- it wouldn't call the validity of existing tsvector indexes into question. I see that to_tsquery is even sillier than plainto_tsquery: regression=# select to_tsquery('simple','boyers-moore'); to_tsquery ----------------------------------------- 'boyers-moore' <-> 'boyers' <-> 'moore' (1 row) which is absolutely not a sane translation. It seems to me that in both cases we'd be better off generating "'boyers' <-> 'moore'", without the compound token at all. Maybe there's a case for the weaker 'boyers' & 'moore' translation, but I think if people wanted that they'd just enter separate words. regards, tom lane ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 20:48 James Addison <[email protected]> parent: Tom Lane <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: James Addison @ 2022-01-25 20:48 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: Ivan Panchenko <[email protected]>; [email protected] I'm uncertain why parsing hyphenated query text produces compound tokens? There are a couple of references[1][2] in the documentation about the dash character being converted to a boolean not (!) operator by websearch_to_tsquery, but that seems unrelated. postgres=# select plainto_tsquery('simple', 'a-b'); plainto_tsquery ------------------- 'a-b' & 'a' & 'b' (1 row) postgres=# select plainto_tsquery('simple', 'a_b'); plainto_tsquery ----------------- 'a' & 'b' (1 row) postgres=# select plainto_tsquery('simple', 'a+b'); plainto_tsquery ----------------- 'a' & 'b' (1 row) [1] - https://www.postgresql.org/docs/14/functions-textsearch.html [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES On Tue, 25 Jan 2022 at 17:54, Tom Lane <[email protected]> wrote: > > Ivan Panchenko <[email protected]> writes: > > The actual explanation can be seen from comparing a tsvector with a tsquery. > > To avoid stemming effects, we use the simple configuration below. > > > # select plainto_tsquery('simple','boyers-moore'); > > > plainto_tsquery > > ------------------------------------- > > 'boyers-moore' & 'boyers' & 'moore' > > > # select to_tsvector('simple','boyers-moore-horspool'); > > > to_tsvector > > ------------------------------------------------------------- > > 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 > > > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be > > > 'boyers-moore' | ('boyers' & 'moore') > > > May be, it is worth changing to_tsquery() behavior for such cases. > > Changing the behavior of to_tsquery is certainly a lot less scary > than changing to_tsvector --- it wouldn't call the validity of > existing tsvector indexes into question. > > I see that to_tsquery is even sillier than plainto_tsquery: > > regression=# select to_tsquery('simple','boyers-moore'); > to_tsquery > ----------------------------------------- > 'boyers-moore' <-> 'boyers' <-> 'moore' > (1 row) > > which is absolutely not a sane translation. > > It seems to me that in both cases we'd be better off generating > "'boyers' <-> 'moore'", without the compound token at all. > Maybe there's a case for the weaker 'boyers' & 'moore' translation, > but I think if people wanted that they'd just enter separate words. > > regards, tom lane > > ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-25 21:23 Ivan Panchenko <[email protected]> parent: James Addison <[email protected]> 0 siblings, 1 reply; 13+ messages in thread From: Ivan Panchenko @ 2022-01-25 21:23 UTC (permalink / raw) To: James Addison <[email protected]>; Tom Lane <[email protected]>; +Cc: [email protected] On 25.01.2022 23:48, James Addison wrote: > I'm uncertain why parsing hyphenated query text produces compound tokens? Because in some cases user wants to search the full hyphenated words, not parts of them. But the parser is pluggable, it is possible to develop another one, such as pg_tsparser [1] which does the same for underscores. *to_tsquery functions are also changeable. There can exist plenty of them according to different user requirements. Such function just translates the query from the user query language with its semantics into the tsquery language. So you may write your own and contribute it to community or not. Another option is to make a wrapper function which will modify the result of existing *to_tsquery function to fit your task. > There are a couple of references[1][2] in the documentation about the > dash character being converted to a boolean not (!) operator by > websearch_to_tsquery, but that seems unrelated. > > postgres=# select plainto_tsquery('simple', 'a-b'); > plainto_tsquery > ------------------- > 'a-b' & 'a' & 'b' > (1 row) > > postgres=# select plainto_tsquery('simple', 'a_b'); > plainto_tsquery > ----------------- > 'a' & 'b' > (1 row) > > postgres=# select plainto_tsquery('simple', 'a+b'); > plainto_tsquery > ----------------- > 'a' & 'b' > (1 row) In these examples, some characters are removed by the parser. Try ts_debug('simple', 'a+b'). > > [1] - https://www.postgresql.org/docs/14/functions-textsearch.html > [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES > On Tue, 25 Jan 2022 at 17:54, Tom Lane <[email protected]> wrote: >> Ivan Panchenko <[email protected]> writes: >>> The actual explanation can be seen from comparing a tsvector with a tsquery. >>> To avoid stemming effects, we use the simple configuration below. >>> # select plainto_tsquery('simple','boyers-moore'); >>> plainto_tsquery >>> ------------------------------------- >>> 'boyers-moore' & 'boyers' & 'moore' >>> # select to_tsvector('simple','boyers-moore-horspool'); >>> to_tsvector >>> ------------------------------------------------------------- >>> 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 >>> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be >>> 'boyers-moore' | ('boyers' & 'moore') >>> May be, it is worth changing to_tsquery() behavior for such cases. >> Changing the behavior of to_tsquery is certainly a lot less scary >> than changing to_tsvector --- it wouldn't call the validity of >> existing tsvector indexes into question. >> >> I see that to_tsquery is even sillier than plainto_tsquery: >> >> regression=# select to_tsquery('simple','boyers-moore'); >> to_tsquery >> ----------------------------------------- >> 'boyers-moore' <-> 'boyers' <-> 'moore' >> (1 row) >> >> which is absolutely not a sane translation. >> >> It seems to me that in both cases we'd be better off generating >> "'boyers' <-> 'moore'", without the compound token at all. >> Maybe there's a case for the weaker 'boyers' & 'moore' translation, >> but I think if people wanted that they'd just enter separate words. Matching the compond token might be significant for ranking. (?) Probably, there is no universal *to_tsquery function and no universal parser to fit all users. [1] https://github.com/postgrespro/pg_tsparser >> >> regards, tom lane >> >> regards, Ivan ^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Mailing list search engine: surprising missing results? @ 2022-01-26 08:28 James Addison <[email protected]> parent: Ivan Panchenko <[email protected]> 0 siblings, 0 replies; 13+ messages in thread From: James Addison @ 2022-01-26 08:28 UTC (permalink / raw) To: Ivan Panchenko <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected] On Tue, 25 Jan 2022 at 21:23, Ivan Panchenko <[email protected]> wrote: > > On 25.01.2022 23:48, James Addison wrote: > > I'm uncertain why parsing hyphenated query text produces compound tokens? > > Because in some cases user wants to search the full hyphenated words, > not parts of them. That makes sense, although to refer back to a previous suggestion of yours, we could allow matching on the full hyphenated words by emitting an 'OR' condition from the parsed query, instead of 'AND' (perhaps using an argument?). In other words: # expected query to achieve a match (from your previous post in this thread) 'boyers-moore' | ('boyers' & 'moore') # actual query that does not result in a match today (plainto_tsquery for 'boyer-moore') 'boyer-moore' & 'boyer' & 'moore' > >> It seems to me that in both cases we'd be better off generating > >> "'boyers' <-> 'moore'", without the compound token at all. > >> Maybe there's a case for the weaker 'boyers' & 'moore' translation, > >> but I think if people wanted that they'd just enter separate words. > > Matching the compond token might be significant for ranking. (?) Yes that does seem likely. The knowledge that there is an exact-match token in the results could be important for various use cases (including relevance scoring). > Probably, there is no universal *to_tsquery function and no universal > parser to fit all users. That seems possible too, yep. ^ permalink raw reply [nested|flat] 13+ messages in thread
end of thread, other threads:[~2022-01-26 08:28 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2022-01-23 12:49 Mailing list search engine: surprising missing results? James Addison <[email protected]> 2022-01-24 07:27 ` Laurenz Albe <[email protected]> 2022-01-24 19:28 ` Bruce Momjian <[email protected]> 2022-01-24 20:47 ` Tom Lane <[email protected]> 2022-01-24 22:15 ` Bruce Momjian <[email protected]> 2022-01-25 11:04 ` Oleg Bartunov <[email protected]> 2022-01-25 12:43 ` Laurenz Albe <[email protected]> 2022-01-25 16:22 ` Tom Lane <[email protected]> 2022-01-25 17:02 ` Ivan Panchenko <[email protected]> 2022-01-25 17:54 ` Tom Lane <[email protected]> 2022-01-25 20:48 ` James Addison <[email protected]> 2022-01-25 21:23 ` Ivan Panchenko <[email protected]> 2022-01-26 08:28 ` James Addison <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox