Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCJcy-0005pl-GC for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 11:04:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCJcw-00054w-Mw for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 11:04:42 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCJcw-00054n-EY for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 11:04:42 +0000 Received: from mail.postgrespro.ru ([93.174.131.139]) by magus.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCJcs-0002qq-TJ for pgsql-www@postgresql.org; Tue, 25 Jan 2022 11:04:41 +0000 Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (not verified)) by mail.postgrespro.ru (Postfix) with ESMTPSA id CBB3121C9460 for ; Tue, 25 Jan 2022 14:04:36 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=postgrespro.ru; s=mail; t=1643108677; bh=OgL83MBp4OVlCW+/6Wx+iP23JTvSbBO31iJTHoy6D3o=; h=References:In-Reply-To:From:Date:Subject:To:Cc; b=kDNsi/0IfR4ZK/oeI/B3GsPa186v6367ZlR+x/udup4X/QwQ6iyx1Ct7gKltpjhYv 4U55z9uzkReeai4WdVgLRhu3+WcWzcdDpKKd1cdzHQp5q5h19p5fiEu0LZzn370ht2 YrKEwqBRLnA2Zk5y11P4LMgfNcsI6t28u0hI98rc= Received: by mail-qt1-f169.google.com with SMTP id w6so23142722qtk.4 for ; Tue, 25 Jan 2022 03:04:36 -0800 (PST) X-Gm-Message-State: AOAM533I7tXW5OrK90zobf5JmM1RHvf44XvIYi1k6czR4cfRyxdTwbb6 GJUkA27a4lK8c4oucn7dYonvbqUyQQgknd2WhbM= X-Google-Smtp-Source: ABdhPJxRKS8dsKKyGO/BHsf5DgLxa3Tm5XJiwnxm/8It907DnJivWnyCe8nCzpos6NYN0LMs9YxGy1Ht7dAzlEr99ME= X-Received: by 2002:aed:30a3:: with SMTP id 32mr15792944qtf.660.1643108675457; Tue, 25 Jan 2022 03:04:35 -0800 (PST) MIME-Version: 1.0 References: <2150096.1643057249@sss.pgh.pa.us> In-Reply-To: <2150096.1643057249@sss.pgh.pa.us> From: Oleg Bartunov Date: Tue, 25 Jan 2022 14:04:09 +0300 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Mailing list search engine: surprising missing results? To: Tom Lane Cc: Bruce Momjian , Laurenz Albe , James Addison , PostgreSQL WWW Content-Type: multipart/alternative; boundary="000000000000ed96da05d6660cad" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ed96da05d6660cad Content-Type: text/plain; charset="UTF-8" On Mon, Jan 24, 2022 at 11:47 PM Tom Lane wrote: > Bruce Momjian writes: > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > >> isn't: > > > Wow, he showed me this problem earlier but I never suspected it was > > stemming issue because I never considered proper nowns could be > > stem-adjusted, but it is obvious they can. > > I wonder if we should change that so that components of a compound > word are consistently stemmed the same way. > Something like this SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3 (1 row) > > regards, tom lane > > > -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company --000000000000ed96da05d6660cad Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Mon, Jan 24, 2022 at 11:47 PM Tom = Lane <tgl@sss.pgh= .pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>> The reason is that the 'moore' in 'boyer-moore' is= stemmed, since it
>> is at the end of the word, while the 'moore' in 'Boyer= -Moore-Horspool'
>> isn't:

> Wow, he showed me this problem earlier but I never suspected it was > stemming issue because I never considered proper nowns could be
> stem-adjusted, but it is obvious they can.

I wonder if we should change that so that components of a compound
word are consistently stemmed the same way.

=

Something like this

S= ELECT to_tsvector('english', 'Boyer-Moore-Horspool');
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0to_tsvector
-------------------------------------------= ---------------
=C2=A0'boyer':2 'boyer-moore-horspool':1= 'boyer-moore':1=C2=A0 'moore-horspool':1=C2=A0 'horsp= ool':4 'moor':3
(1 row)


<= br>

=C2=A0

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 regards, tom lane




--
Postgres Professional:=C2= =A0http://www.postgrespro.com
The Russian Postgres Company<= /span>
--000000000000ed96da05d6660cad--