Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCQ1a-0007bx-84 for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 17:54:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCQ1Z-0000xM-78 for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 17:54:33 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCQ1Z-0000xC-0l for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 17:54:33 +0000 Received: from sss.pgh.pa.us ([66.207.139.130]) by makus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCQ1W-0003gg-Fj for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 17:54:31 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 20PHsS3I2274256; Tue, 25 Jan 2022 12:54:29 -0500 From: Tom Lane To: Ivan Panchenko cc: pgsql-www@lists.postgresql.org Subject: Re: Mailing list search engine: surprising missing results? In-reply-to: <79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru> References: <2150096.1643057249@sss.pgh.pa.us> <22d5245c9c5a9aa05a0510bdd52458812140a870.camel@cybertec.at> <2257661.1643127753@sss.pgh.pa.us> <79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru> Comments: In-reply-to Ivan Panchenko message dated "Tue, 25 Jan 2022 20:02:36 +0300" MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-ID: <2274254.1643133268.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Tue, 25 Jan 2022 12:54:28 -0500 Message-ID: <2274255.1643133268@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Ivan Panchenko writes: > The actual explanation can be seen from comparing a tsvector with a tsqu= ery. > To avoid stemming effects, we use the simple configuration below. > # select plainto_tsquery('simple','boyers-moore'); > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 plainto_ts= query > ------------------------------------- > =C2=A0'boyers-moore' & 'boyers' & 'moore' > # select to_tsvector('simple','boyers-moore-horspool'); > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 t= o_tsvector > ------------------------------------------------------------- > =C2=A0'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 > Obviously, such tsvector does not match the above tsquery. I think,a bet= ter tsquery for this query would be > =C2=A0'boyers-moore' | ('boyers' & 'moore') > May be, it is worth changing to_tsquery() behavior for such cases. Changing the behavior of to_tsquery is certainly a lot less scary than changing to_tsvector --- it wouldn't call the validity of existing tsvector indexes into question. I see that to_tsquery is even sillier than plainto_tsquery: regression=3D# select to_tsquery('simple','boyers-moore'); to_tsquery = ----------------------------------------- 'boyers-moore' <-> 'boyers' <-> 'moore' (1 row) which is absolutely not a sane translation. It seems to me that in both cases we'd be better off generating "'boyers' <-> 'moore'", without the compound token at all. Maybe there's a case for the weaker 'boyers' & 'moore' translation, but I think if people wanted that they'd just enter separate words. regards, tom lane