Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sj4cU-00E1Ww-Da for pgsql-general@arkaria.postgresql.org; Tue, 27 Aug 2024 22:24:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sj4cS-00Da4O-Gm for pgsql-general@arkaria.postgresql.org; Tue, 27 Aug 2024 22:24:57 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sj4cS-00Da4C-4o for pgsql-general@lists.postgresql.org; Tue, 27 Aug 2024 22:24:56 +0000 Received: from sss.pgh.pa.us ([68.162.161.243]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sj4cO-001pEg-Ok for pgsql-general@lists.postgresql.org; Tue, 27 Aug 2024 22:24:56 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 47RMOqfq3150737; Tue, 27 Aug 2024 18:24:52 -0400 From: Tom Lane To: Stanislav Kozlovski cc: "pgsql-general@lists.postgresql.org" Subject: Re: tsvector limitations - why and how In-reply-to: References: Comments: In-reply-to Stanislav Kozlovski message dated "Tue, 27 Aug 2024 20:38:03 -0000" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <3150735.1724797492.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Tue, 27 Aug 2024 18:24:52 -0400 Message-ID: <3150736.1724797492@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Stanislav Kozlovski writes: > I was aware of the limitations of FTS and tried to ensure I didn't hit any - but w= hat I missed was that the maximum allowed lexeme position was 16383 and ev= erything above silently gets set to 16383. I was searching for a phrase (t= wo words) at the end of the book and couldn't find it. After debugging I r= ealized that my phrase's lexemes were being set to 16383, which was inaccu= rate. > ... > The problem I had is that it breaks FOLLOWED BY queries, essentially sto= pping you from being able to match on phrases (more than one word) on larg= e text. Yeah. FOLLOWED BY didn't exist when the tsvector storage representation was designed, so the possible inaccuracy of the lexeme positions wasn't such a big deal. > Why is this still the case? Because nobody's done the significant amount of work needed to make it better. I think an acceptable patch would have to support both the current tsvector representation and a "big" version that's able to handle anything up to the 1GB varlena limit. (If you were hoping for documents bigger than that, you'd be needing a couple more orders of magnitude worth of work.) We might also find that there are performance bottlenecks that'd have to be improved, but even just making the code cope with two representations would be a big patch. There has been some cursory talk about this, I think, but I don't believe anyone's actually worked on it since the 2017 patch you mentioned. I'm not sure if that patch is worth using as the basis for a fresh try: it looks like it had some performance issues, and AFAICS it didn't really improve the lexeme-position limit. (Wanders away wondering if the expanded-datum infrastructure could be exploited here...) regards, tom lane