public inbox for [email protected]
help / color / mirror / Atom feedFrom: Tom Lane <[email protected]>
To: Stanislav Kozlovski <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: tsvector limitations - why and how
Date: Tue, 27 Aug 2024 18:24:52 -0400
Message-ID: <[email protected]> (raw)
In-Reply-To: <DU0PR10MB60604534BEBB91C67FD743A78A942@DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM>
References: <DU0PR10MB60604534BEBB91C67FD743A78A942@DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM>
Stanislav Kozlovski <[email protected]> writes:
> I was aware of the limitations of FTS<https://www.postgresql.org/docs/17/textsearch-limitations.html; and tried to ensure I didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was inaccurate.
> ...
> The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases (more than one word) on large text.
Yeah. FOLLOWED BY didn't exist when the tsvector storage
representation was designed, so the possible inaccuracy of the
lexeme positions wasn't such a big deal.
> Why is this still the case?
Because nobody's done the significant amount of work needed to make
it better. I think an acceptable patch would have to support both
the current tsvector representation and a "big" version that's able
to handle anything up to the 1GB varlena limit. (If you were hoping
for documents bigger than that, you'd be needing a couple more
orders of magnitude worth of work.) We might also find that there
are performance bottlenecks that'd have to be improved, but even just
making the code cope with two representations would be a big patch.
There has been some cursory talk about this, I think, but I don't
believe anyone's actually worked on it since the 2017 patch you
mentioned. I'm not sure if that patch is worth using as the basis
for a fresh try: it looks like it had some performance issues, and
AFAICS it didn't really improve the lexeme-position limit.
(Wanders away wondering if the expanded-datum infrastructure could
be exploited here...)
regards, tom lane
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: tsvector limitations - why and how
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox