Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sjfaq-002oEx-33 for pgsql-general@arkaria.postgresql.org; Thu, 29 Aug 2024 13:53:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sjfao-002jaI-6b for pgsql-general@arkaria.postgresql.org; Thu, 29 Aug 2024 13:53:42 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sjfan-002jYh-K1 for pgsql-general@lists.postgresql.org; Thu, 29 Aug 2024 13:53:42 +0000 Received: from mail-am0eur02olkn2059.outbound.protection.outlook.com ([40.92.49.59] helo=EUR02-AM0-obe.outbound.protection.outlook.com) by magus.postgresql.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sjfaj-0025hs-Vg for pgsql-general@lists.postgresql.org; Thu, 29 Aug 2024 13:53:41 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=vJZ1eIDM+ZVdzYO+WlsebNacRhDFeO9kf8N3xJQ8cVYf+2nkMnl4WVkSTcYqX3M9/DQjN8f+DAB6vH7Dn2WKaczX4XuNY1g1QZdnFyh07ZCJ49jRG03B5v7O34z4+T3om24jTZZwlTgDMmYhMsJ7sPxJ9M3TEvSom2uheQWKW9fn5yRWGxcdQ8ESbcER744r4DYJhVQRobAFFAMj3LhJLncpoKifB8eujLS5EWI+aPaJFY5kTVbLKVuJCQNwVHX/btuO70pwH/IYSyzA+acJXc+wpbMczS0o8SegrPSed7yUx29+LOXugI21pkcLmf9q/LAB/RRLnEE4eIx1DQA9Sg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=tke9wzqBtE5DCTC8s9BRAherYytjbRgnRKWFXsB0mwE=; b=VcTeHZy2u8cjXAw/tDinXuiN8G/fxIxeKd0Bt758lnvnTjopVeIBhtYHxjY/pcbGPpbOWPcSnX1w8NNxP5qSlZw0YW1+JAiQPDUvVJ0+59X6E9Eanh7dDDBSxJHZm81QZXVTC85RuKmFogVs1RFn0oCnbklOGf5dSiDjgBa74rOnfy8G+6vMQOQ58K0Zf1a/6Phrne6zmUmbUbF1e14V7rUED6PQwq0egNzD4bJ/eThlWvSdLryutuHNz8KiuOxSTH131euvKPmsLV89+Qhd4ZqfQJNcKO6/9VcF3+VgS07rdd9rX/9P5Ai6HJyPcBE4IMpJuxMClNSJJewGB+160A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=tke9wzqBtE5DCTC8s9BRAherYytjbRgnRKWFXsB0mwE=; b=BIO//JqTgybOfBlWv9+Rg8f5F/dNBGZmzwmKPP3luf+vztgQSQY+4JuMbU3+f8BYMejwzNg/E4FavOTjgY6FIG1J0gHc8tnem/8dRGPL4a1WximHyiAy6HIhp9rfHpk/jCdQS6wP1PrRgB1ENSdtpOo6LlBKLVeQa77kd5hXRW6YqljKR0dsmzTIXgvnDYIUgF4/xqXey9SzI6kvvYJm4y4oiXEEl3B9RTEckHoCRzJB8n/+GT/oFg0cqv6jqYi5kcvle13yxPhAjjhyUliEmIOQLd96BLMRz2JwYjtcoyjXQ0yVJJ9hVcuYd4Ck/ZGuztf7hkXCTdhnTi/S/K8h4g== Received: from DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:10:3e3::7) by PA2PR10MB9133.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:102:421::15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7897.27; Thu, 29 Aug 2024 13:53:35 +0000 Received: from DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM ([fe80::b49d:da88:f1cc:953b]) by DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM ([fe80::b49d:da88:f1cc:953b%4]) with mapi id 15.20.7897.027; Thu, 29 Aug 2024 13:53:35 +0000 From: Stanislav Kozlovski To: Tom Lane CC: "pgsql-general@lists.postgresql.org" Subject: Re: tsvector limitations - why and how Thread-Topic: tsvector limitations - why and how Thread-Index: AQHa+L5cSGvGB0Cc2kCtr2rkZIQ6sLI7rmIAgAKUDRs= Date: Thu, 29 Aug 2024 13:53:35 +0000 Message-ID: References: <3150736.1724797492@sss.pgh.pa.us> In-Reply-To: <3150736.1724797492@sss.pgh.pa.us> Accept-Language: en-AU, bg-BG, en-US Content-Language: en-AU X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: x-ms-exchange-messagesentrepresentingtype: 1 x-tmn: [QaCs7xiKGQfU/U1R70VhDozzOuyYjDbxA/Wk/Wa/QyHq/Sdh+8OnoomdE3n3d1C2] x-ms-publictraffictype: Email x-ms-traffictypediagnostic: DU0PR10MB6060:EE_|PA2PR10MB9133:EE_ x-ms-office365-filtering-correlation-id: 14626880-c4a1-402b-b567-08dcc831fa5c x-microsoft-antispam: BCL:0;ARA:14566002|9400799024|15080799006|15030799003|8060799006|19110799003|461199028|12050799009|1602099012|56899033|102099032|3412199025|4302099013|440099028; x-microsoft-antispam-message-info: a7iuuNxm250FfJVaF5br4q3PezVp9AtzyxkIAvu0G2P4ErtdQRjUwVnncm9jHQWOlTxkKQ1FwXPoagNOwkeuq4eSZWLRmYxREbZbmgmTYZebeLanS+fHgL4BiMZ2MHlWwPujDdUsuFrLolgA/lpXXualayDnTJdMu5DlNPYv/1cLv6pDJHbWu9db2Pu6XTfrYI4Oc6ywvwS97P6rq+W51gVVwo6RdiJlfwuhDX/7YHnlX/1P8ekl4aKTgGKZhTl5AWdoEUX0ICnheENylxVgjDiFk+ruvk5dztV2vGbtNKOQDo6gSovOsDptzzM+92x9YFu94mpjV8/of9ibY8gx5/aMQ6dVKefNXwavDn9m1aKXQlAn/yA4O/YVRACV7s74KSM0iSyKylcbSug/kDXkDO0AapZcUUr37O3PxnQ07xxnMuxgoOPJdKhp0DfiwnqzTgn6VAwOKKINTA856gb09fCs02nJXvLCW+9MedQMAMf0AqaxBk2MxOS+pzHkF9+0wM9ICUjZrp71hW7tHQqKFDThiNzcY51n05IWZh0qzvwAKUgi51cSdSLwIhPelb6zrT1e4pLkM86W+p68GUSQ+jvn2ReVuH0Fe2HoByHdBG1o7lo1fslgH50mWT04BCP/rDQMYlhRPBWo5vG7TSwKsohr0AB2/qE5IpsBnIP+/nN+STYplO3qS9giD0B46N4D4iZW+MN+PW76LDMO5izTQrU1EMhbX6xXD3PXZ72/laXms5vsAJG5pxIetvmH6TIx8DS6t9zpHKVRi3LKCcfi6R7pAdpaAm/Aa2T+E00DaLJKvJZ41NAbC4CUlrBXX969uy9Y/XwQpprs1dw3hUlfDieubn+EbE40UzRm3aQt6SgVoPxwuNyAWaavZ/KuTdDhwRSSwg+mgeSCwpJ4K1Yp+FfEOevr16k61/5E0Ah10EnhdnaoZv6peOCxaMHjTaQw x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?UjUz9/OWZuWY8+XvhA7VLlGHoJE3yixsl3QonDi5b2GjmmhosgYQENUhh5KI?= =?us-ascii?Q?FYUV43fjbKp4LGIXDIrsfhgQYGqXDs0LHKjb74l4GKYk1Eilcq5b3mqlS/CU?= =?us-ascii?Q?rwBYqQt7ExxRfrHpa6xZRccYEBQtHzUHVF3h0dndNtsZAnelyX0pUxgWfOxI?= =?us-ascii?Q?hrZ/B3kDSivuO0dBNpV7dZO3C07aB0cySv8SPrdc2QONxf2fcdFlgM5cbI56?= =?us-ascii?Q?dyBcLvbMZ9Yp/jYO02YigV80pLQa8oy7LuccRoZ2XL3WCDJ32/PCQpsIskim?= =?us-ascii?Q?HeIMt2a6s7eWVPu/nPGnNk03yiQrJ53KsnLeCkySp8rHT8mQ48IZzermbLbq?= =?us-ascii?Q?oLvCXsUkDnyCAJpAdCWK/79J7VUG80tBnsIKgQ3z9gMvvpvETtO+n21gFoMQ?= =?us-ascii?Q?AdeNeA2jbEn+tcQ5Cqhwt0x1S4GlCt1WNLG1hvtY5VCkrb+iyt7ltwaOkP/7?= =?us-ascii?Q?4Ncr2OJhYUC4g2EnP9FaaNfCBRGMB06kysxqNUh054vCGImT1FFkY7dwSsH5?= =?us-ascii?Q?e6z4g0Q/a4pwBxXay8xMg2CVY5mxP9V8IKIYMC7pK4aKWtr3DHX/co9Wuf4P?= =?us-ascii?Q?10BIA05hzZXDi3gi05QOUPIMZvKk//MYQeOVvfmggc8PbxDN+lmGHxAnxxli?= =?us-ascii?Q?vAAo/P/l5FhuR2PhbdTHPTuPdccZX+9pbFc98BwhGoFJz5EQZ4+rKIPZqw18?= =?us-ascii?Q?TQmTNJ8s1pVfdHGeqoN0ipF9KiWdXgxmAAavnnxEoeVFUeQRHl45Wv7Y85Mf?= =?us-ascii?Q?DYVBWblJCROS0lmAKNWPT2n/U1XSHw9KOqlV6ALqgwYXgV+ROQ9wAWnNxebW?= =?us-ascii?Q?My2e71x0fy/YtzYa67bXIh50+J/tKTfEA6Q/Dg+16HxgX/bRNz4fJvBCA3OT?= =?us-ascii?Q?NmWvFoyQ/vKV2UFlNBMCXrv/DIzwsjnq3yvBe1xNfge/UxoOJuG7OLvFq6I0?= =?us-ascii?Q?Sp+0w5aD7Sr1DkZtN8Lswfdw+Lqv4GJLqcjZgNe7pyG5R7YPVeW8j4b7hlNQ?= =?us-ascii?Q?eK1cPd6Qj3w/pIVBH2vVVsxM6A0YVAN4980b6MPvOiWUbvcsyNjR7fVjFQQT?= =?us-ascii?Q?38/IdjxG7/Xc93tiVM6aw22xvRlce0C0XbHZSALYGzZpRrph/azMcb0k8AQo?= =?us-ascii?Q?VQWElG1bYwOaS4dSGll53Yhg6aMWgY9RS85knchSMTZHuK46y0UR+ozRur40?= =?us-ascii?Q?5L0MaMCd2UAtr/o0Zn4619x6qGBPP+C2gB6Z0k4oF8S6GNcqfBVT9w5FIk0Z?= =?us-ascii?Q?gIhzVKjzgMrSngxDLezYiGBAHToqgf3INHu0u3akBwTQWzKP9Nb2MUZVuLfG?= =?us-ascii?Q?MJtaWEJwG6zTC0g/IEA9Hd//?= Content-Type: multipart/alternative; boundary="_000_DU0PR10MB6060BAFEB301D093D6B4F57D8A962DU0PR10MB6060EURP_" MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-Network-Message-Id: 14626880-c4a1-402b-b567-08dcc831fa5c X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Aug 2024 13:53:35.8697 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: PA2PR10MB9133 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --_000_DU0PR10MB6060BAFEB301D093D6B4F57D8A962DU0PR10MB6060EURP_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Thanks for the reply, Tom. Makes sense to me. Worth saying - one very large misunderstanding that was pointed out to me w= as that the position limit is not per character - it is per word. This make= s sense given the parser parses per word - but I completely missed it. It b= asically completely changes my calculations: > If I want to store a whole books' content - like PostgreSQL: Up and Runni= ng (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-pa= ge books average about 82500 words, English words average about 6.5-4 chara= cters, meaning a tsvector will hold the positions of no more than [2520-327= 7] words). to having to store 8250 words per chapter, meaning just 10 rows as each cha= pter would cleanly fit into a tsvector. (assuming the max position per lexe= me isn't hit) Meaning the limitations aren't as egregious as I first thought. This furthe= r explains to me why there hasn't been much interest by others to expand th= e limits. ________________________________ From: Tom Lane Sent: Wednesday, 28 August 2024 0:24 To: Stanislav Kozlovski Cc: pgsql-general@lists.postgresql.org Subject: Re: tsvector limitations - why and how Stanislav Kozlovski writes: > I was aware of the limitations of FTS and tried to ensure I didn't hit any - but wha= t I missed was that the maximum allowed lexeme position was 16383 and every= thing above silently gets set to 16383. I was searching for a phrase (two w= ords) at the end of the book and couldn't find it. After debugging I realiz= ed that my phrase's lexemes were being set to 16383, which was inaccurate. > ... > The problem I had is that it breaks FOLLOWED BY queries, essentially stop= ping you from being able to match on phrases (more than one word) on large = text. Yeah. FOLLOWED BY didn't exist when the tsvector storage representation was designed, so the possible inaccuracy of the lexeme positions wasn't such a big deal. > Why is this still the case? Because nobody's done the significant amount of work needed to make it better. I think an acceptable patch would have to support both the current tsvector representation and a "big" version that's able to handle anything up to the 1GB varlena limit. (If you were hoping for documents bigger than that, you'd be needing a couple more orders of magnitude worth of work.) We might also find that there are performance bottlenecks that'd have to be improved, but even just making the code cope with two representations would be a big patch. There has been some cursory talk about this, I think, but I don't believe anyone's actually worked on it since the 2017 patch you mentioned. I'm not sure if that patch is worth using as the basis for a fresh try: it looks like it had some performance issues, and AFAICS it didn't really improve the lexeme-position limit. (Wanders away wondering if the expanded-datum infrastructure could be exploited here...) regards, tom lane --_000_DU0PR10MB6060BAFEB301D093D6B4F57D8A962DU0PR10MB6060EURP_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
Thanks for the reply, Tom.

Makes sense to me.

Worth saying - one very large misunderstanding that was pointed out to me w= as that the position limit is not per character - it is per word. This make= s sense given the parser parses per word - but I completely missed it. It b= asically completely changes my calculations:

> If I want to store a whole books' content - like PostgreSQL: Up and Ru= nning (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300= -page books average about 82500 words, English words average about 6.5-4 ch= aracters, meaning a tsvector will hold the positions of no more than [2520-3277] words).

to having to store 8250 words per chapter, meaning just 10 rows as each cha= pter would cleanly fit into a tsvector. (assuming the max position per lexe= me isn't hit)

Meaning the limitations aren't as egregious as I first thought. This furthe= r explains to me why there hasn't been much interest by others to expand th= e limits.



From: Tom Lane <tgl@sss.= pgh.pa.us>
Sent: Wednesday, 28 August 2024 0:24
To: Stanislav Kozlovski <Stanislav_Kozlovski@outlook.com>
Cc: pgsql-general@lists.postgresql.org <pgsql-general@lists.postg= resql.org>
Subject: Re: tsvector limitations - why and how
 
Stanislav Kozlovski <Stanislav_Kozlovski@outloo= k.com> writes:
> I was aware of the limitations of FTS<https://www.postgresql.org/do= cs/17/textsearch-limitations.html> and tried to ensure I didn't hit = any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set t= o 16383. I was searching for a phrase (two words) at the end of the book an= d couldn't find it. After debugging I realized that my phrase's lexemes wer= e being set to 16383, which was inaccurate.
> ...
> The problem I had is that it breaks FOLLOWED BY queries, essentially s= topping you from being able to match on phrases (more than one word) on lar= ge text.

Yeah.  FOLLOWED BY didn't exist when the tsvector storage
representation was designed, so the possible inaccuracy of the
lexeme positions wasn't such a big deal.

> Why is this still the case?

Because nobody's done the significant amount of work needed to make
it better.  I think an acceptable patch would have to support both
the current tsvector representation and a "big" version that's ab= le
to handle anything up to the 1GB varlena limit.  (If you were hoping for documents bigger than that, you'd be needing a couple more
orders of magnitude worth of work.)  We might also find that there
are performance bottlenecks that'd have to be improved, but even just
making the code cope with two representations would be a big patch.

There has been some cursory talk about this, I think, but I don't
believe anyone's actually worked on it since the 2017 patch you
mentioned.  I'm not sure if that patch is worth using as the basis
for a fresh try: it looks like it had some performance issues, and
AFAICS it didn't really improve the lexeme-position limit.

(Wanders away wondering if the expanded-datum infrastructure could
be exploited here...)

            &nb= sp;           regards, to= m lane
--_000_DU0PR10MB6060BAFEB301D093D6B4F57D8A962DU0PR10MB6060EURP_--