Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCSkH-0008L9-SD for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 20:48:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nCSkG-0004q4-OQ for pgsql-www@arkaria.postgresql.org; Tue, 25 Jan 2022 20:48:52 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nCSkG-0004pv-Fg for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 20:48:52 +0000 Received: from mail-yb1-xb33.google.com ([2607:f8b0:4864:20::b33]) by magus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1nCSkC-0008BY-Jq for pgsql-www@lists.postgresql.org; Tue, 25 Jan 2022 20:48:52 +0000 Received: by mail-yb1-xb33.google.com with SMTP id g14so65195120ybs.8 for ; Tue, 25 Jan 2022 12:48:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jp-hosting.net; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mq4tDZDbjks4A0//y5gA/1FIIcDNKAkXcUvZrY2DdLs=; b=KcNNBfdZSMgwFEHXAGqg/AF3VSRHXAr2J3AQYXeHdNTtxn2ij5GPmHTzxA3gNzbuku eYRJDZVc8Tm87mbxNHwqYI/rfvNNQzvpjbZeJYDmhFWP1rrlw5vxl1hINa9b3U5/jrsm ycbUkY8I1/EqjNuowdSB2y3v+R2NFXFmBCoLbmrO2gA40y3LLTAhdTYm+lTtD81EWHMx 7EqNCkJNrHnxDCJrURrbmeZQEwjfR1QouxCPsnBeMmeJbXdIeFZjiEQrW1ithLapYZ+9 +wIRpZi8e0Wq1/VaMj0dTyW4Rfpl+A+iUdNLAmJOBjdEA/OW4h79l3MUj581XHehRqqD cMeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mq4tDZDbjks4A0//y5gA/1FIIcDNKAkXcUvZrY2DdLs=; b=skOjVi24JNatDmdhIhrsxhnBigox2axdguq+E2BP2Q+ifnR5AT5J2y/WgL7mqHFufw 4L7rWNwxI0ekeTA8ge/oLDv+8PqE6Oc1iQp49Qf+8Lpa5oITPHCTjJwosqM0dIVH2psp GTumz56W1Cn1HBlKrzXfLu4D9W53uPK0yS01Fqvi2wfWDHJgUhRtQPnn6AtU5fyVjvXd BIK/YQEf7WDBJWkjmCu2ZfaNrg+AB5fA6YXt543wNGpkNHTit361eowTB4nmoZUGa8IP uzJKRchyed8DZk+k6tJ43OgyG/9vyXN4r18leK5Dd6ZxJtKWndBvu2ZROK3/y6F5qK5F hCkQ== X-Gm-Message-State: AOAM530fPJ2paPx53ae8xBe6DsajYDdYJudvMcY3/K0sGRCKYDywjc6K U9A/Rf4BBWyXR5Nki6CwtRfEEfZqCt4tb93KFvO0Ig== X-Google-Smtp-Source: ABdhPJyFcFEnYAQnk+V0r/G+jVKAR36r28NipjWrEYwWpav0GXzn8wzxYGw8KqY1qoF4lTUbIWY64mv8Uxt2ZE/oKR8= X-Received: by 2002:a25:b847:: with SMTP id b7mr32598694ybm.751.1643143726038; Tue, 25 Jan 2022 12:48:46 -0800 (PST) MIME-Version: 1.0 References: <2150096.1643057249@sss.pgh.pa.us> <22d5245c9c5a9aa05a0510bdd52458812140a870.camel@cybertec.at> <2257661.1643127753@sss.pgh.pa.us> <79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru> <2274255.1643133268@sss.pgh.pa.us> In-Reply-To: <2274255.1643133268@sss.pgh.pa.us> From: James Addison Date: Tue, 25 Jan 2022 20:48:34 +0000 Message-ID: Subject: Re: Mailing list search engine: surprising missing results? To: Tom Lane Cc: Ivan Panchenko , pgsql-www@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk I'm uncertain why parsing hyphenated query text produces compound tokens? There are a couple of references[1][2] in the documentation about the dash character being converted to a boolean not (!) operator by websearch_to_tsquery, but that seems unrelated. postgres=# select plainto_tsquery('simple', 'a-b'); plainto_tsquery ------------------- 'a-b' & 'a' & 'b' (1 row) postgres=# select plainto_tsquery('simple', 'a_b'); plainto_tsquery ----------------- 'a' & 'b' (1 row) postgres=# select plainto_tsquery('simple', 'a+b'); plainto_tsquery ----------------- 'a' & 'b' (1 row) [1] - https://www.postgresql.org/docs/14/functions-textsearch.html [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES On Tue, 25 Jan 2022 at 17:54, Tom Lane wrote: > > Ivan Panchenko writes: > > The actual explanation can be seen from comparing a tsvector with a tsquery. > > To avoid stemming effects, we use the simple configuration below. > > > # select plainto_tsquery('simple','boyers-moore'); > > > plainto_tsquery > > ------------------------------------- > > 'boyers-moore' & 'boyers' & 'moore' > > > # select to_tsvector('simple','boyers-moore-horspool'); > > > to_tsvector > > ------------------------------------------------------------- > > 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 > > > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be > > > 'boyers-moore' | ('boyers' & 'moore') > > > May be, it is worth changing to_tsquery() behavior for such cases. > > Changing the behavior of to_tsquery is certainly a lot less scary > than changing to_tsvector --- it wouldn't call the validity of > existing tsvector indexes into question. > > I see that to_tsquery is even sillier than plainto_tsquery: > > regression=# select to_tsquery('simple','boyers-moore'); > to_tsquery > ----------------------------------------- > 'boyers-moore' <-> 'boyers' <-> 'moore' > (1 row) > > which is absolutely not a sane translation. > > It seems to me that in both cases we'd be better off generating > "'boyers' <-> 'moore'", without the compound token at all. > Maybe there's a case for the weaker 'boyers' & 'moore' translation, > but I think if people wanted that they'd just enter separate words. > > regards, tom lane > >