Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCOal-00030U-8m for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 16:22:47 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCOak-0001X9-6G for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 16:22:46 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCOaj-0001X0-W4 for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 16:22:45 +0000 Received: from sss.pgh.pa.us ([66.207.139.130]) by magus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCOah-0005Zz-QT for pgsql-www@postgresql.org; Tue, 25 Jan 2022 16:22:45 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 20PGMXGK2257662; Tue, 25 Jan 2022 11:22:34 -0500 From: Tom Lane To: Laurenz Albe cc: Oleg Bartunov , Bruce Momjian , James Addison , PostgreSQL WWW Subject: Re: Mailing list search engine: surprising missing results? In-reply-to: <22d5245c9c5a9aa05a0510bdd52458812140a870.camel@cybertec.at> References: <2150096.1643057249@sss.pgh.pa.us> <22d5245c9c5a9aa05a0510bdd52458812140a870.camel@cybertec.at> Comments: In-reply-to Laurenz Albe message dated "Tue, 25 Jan 2022 13:43:48 +0100" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <2257660.1643127753.1@sss.pgh.pa.us> Date: Tue, 25 Jan 2022 11:22:33 -0500 Message-ID: <2257661.1643127753@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Laurenz Albe writes: > On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: >> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane wrote: >>> Bruce Momjian writes: >>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >>>>> isn't: > Not quite. The problem is question is the "'boyer-moore':1". > If that were "'boyer-moor':1" instead, the problem would disappear. Actually, when I try this here, it seems like the stemming *is* consistent: regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 (1 row) regression=# SELECT to_tsvector('english', 'Boyer-Moore'); to_tsvector ----------------------------------- 'boyer':2 'boyer-moor':1 'moor':3 (1 row) If you try variants of that where the first or third term is stemmable, say regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool'); to_tsvector ----------------------------------------------------------- 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3 (1 row) it sure appears that each component word is stemmed independently already. So I think the original explanation here is wrong and we need to probe more closely. regards, tom lane