Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sj2xB-00DPSD-45 for pgsql-general@arkaria.postgresql.org; Tue, 27 Aug 2024 20:38:13 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sj2x8-00CTZh-3l for pgsql-general@arkaria.postgresql.org; Tue, 27 Aug 2024 20:38:10 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sj2x7-00CTYa-DX for pgsql-general@lists.postgresql.org; Tue, 27 Aug 2024 20:38:10 +0000 Received: from mail-am6eur05olkn2105.outbound.protection.outlook.com ([40.92.91.105] helo=EUR05-AM6-obe.outbound.protection.outlook.com) by makus.postgresql.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sj2x4-001jtW-DV for pgsql-general@lists.postgresql.org; Tue, 27 Aug 2024 20:38:08 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=AgcEe1k9B0hdVkPCleeDiYaiVe0/DOIQMfbSx6dJPWWXQYo1mQICEs1+xB+7wwtKZRXhi1jN9nDTzAl5bxlrfBDze8deC4Rq0ORObUN2FpV87sUVymiI+Zaw45DihdLUlrZin4OSkmxKcSpx7M5mdGZxyD/VXhdtbx5K2//2xSDkNz6vuvT7Ytp0+Ug3Fyv9Xi2LuAXzh239u0ssCo/oXDtQjgnnh7uKC7uhY5gGWRPkqJ+4EoFzRyWhzCuxBhHbXYiNLVrHkV7WWuzw5g7w4gxTM8i5rD3HnOK1JAZGcyyCN/1q4EpghLUQ2AbwIHLfNyqYqqlVconn7exVq+12Ew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vT4N9d4wzzOHbV05DyZ1PX4f8CIq9+vkzCe4EBk6EfI=; b=qMymHalNDbxrrZUgr004NkierkviqqDaQDsJsPAyTRqbaFv1e30HeR9MdpdkwEDPol9YsklKxCVf2DLNPbp/xIXOMztS3OlELW0AwcmN87CwbxHDrq/w9b6PLtsBGKFsJxrGgLV8LENbEMdHhvxmlXYHtGE3iFM2IYIv0r8QX8NfI2kdhUuGHg2WejbigCSycmH6bNw6e8zpnHruVlxut9jdQl5+28VwkI2seOOyGItSHGYJiuBV9TdW+XepHMi71GJBjHNx9kxVqHPJl4YB+Fg3fXRsCBUl/Z1plRwF2qDbW7HCGg6fRjP1sL09C5kbTmpOS5tuNXgWpaOubpEShg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=vT4N9d4wzzOHbV05DyZ1PX4f8CIq9+vkzCe4EBk6EfI=; b=sEK62qE57FJujFgR11F+DSuaMUvVhxIHCsrOkmBcfo1fF8padRqOzbNg+lTY3Z23uvrkCA6JD9bMJKDO7BtG7RtNbdTKgBPWaNJ76ul6ItYCv6zKY2MYR929geUvk8dPG+XLRsT+2LVDsB9gViqfbDh7eQy63pNmRAdDKA2Tm1qhcSCTpEAw8Vyia8hp7ScdKQY2loROt8F1JXYfhp+rnNg//ScJm1RTmj3Fp00K2pIOQ+j0vfTW1nRakYxxRRV4vGeYA6MQEVf/6A1yVU2d0Nf/3eXR4nBeoAtVsWZaL2mZt7SFtwPYCxCQcCW5KYQXv0DZxai3eoPbwbFGXNzV6w== Received: from DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:10:3e3::7) by PR3PR10MB4125.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:102:a8::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7897.26; Tue, 27 Aug 2024 20:38:03 +0000 Received: from DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM ([fe80::b49d:da88:f1cc:953b]) by DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM ([fe80::b49d:da88:f1cc:953b%4]) with mapi id 15.20.7897.021; Tue, 27 Aug 2024 20:38:03 +0000 From: Stanislav Kozlovski To: "pgsql-general@lists.postgresql.org" Subject: tsvector limitations - why and how Thread-Topic: tsvector limitations - why and how Thread-Index: AQHa+L5cSGvGB0Cc2kCtr2rkZIQ6sA== Date: Tue, 27 Aug 2024 20:38:03 +0000 Message-ID: Accept-Language: en-AU, bg-BG, en-US Content-Language: en-AU X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: x-ms-exchange-messagesentrepresentingtype: 1 x-tmn: [3Nj6oaQGJ2oiLPTpShjyJW0Lv8C3YJfK75uJlnhji/KQwJEXqGwMJuLQl2nTNV4pKBo37gGMY1w=] x-ms-publictraffictype: Email x-ms-traffictypediagnostic: DU0PR10MB6060:EE_|PR3PR10MB4125:EE_ x-ms-office365-filtering-correlation-id: 55337b4e-99f9-489f-0865-08dcc6d825e6 x-microsoft-antispam: BCL:0;ARA:14566002|15030799003|12050799009|461199028|8060799006|15080799006|19110799003|9400799024|4302099013|440099028|3412199025|102099032|1602099012; x-microsoft-antispam-message-info: X0V8r2H757WR26D6NNZr8Ey5bGiSkIYzVI2Qb/8lD+ONrfrXCjH+Un4xGOEug/+CMiBDIU9gbUUrf8PFeO52kd+GwyyAhk510QVHnfisUQnYpjhFZlMMsMFGte3v3Ar44ml4OQl49hOYZ8ON2CMt3tQU6grbmENyVkKP02JO25t0/Mm5kCQw51jQVUFyqvdKuOEX11RxfFuECvTB1imK+xDZW7NZSv904SPgDweYpMtBPAGReSiQzTXlN2nxK9MxIpqAUJgncmTxyQbxx1QCaIRjEzN9brH+yUikkhWNBMMl4MV7xvZ10GjYHl/wYecboCGdkZZ3lfFT/a/tF3ANHx47DDHUMiIWWWc07FcPgwrzUsKU0i4dx7IN6g/LnoORwNGthRyllP9SVqgkWEU+3e1QGG85P5f4/E2XhFAkMXzYHk9m9URzAWO6I8XCEF2AMgrIRRp/K6aU85eRTAR0dJIGpulw+mzo/QVzkks9QBoJulKafhI6qy41Rdoz7Kp/8nHubLaD0vqPvsE1hdVYx2uyVk4G5IDgQSLkf3zfHrUDM4KIouIZ2T90wL/PcdO2EYa2PnCZ36wVGGpvPn/uycz/wo6pAgfcpsbGpxs7ZKa3168PFD9gGgSrfR4MADwp70rkn1RTl75LbfV8dQkTVgEH/dZia/24ozzkm7eKg42B5FTPSSbbQOuwemDWOVc9/cH7b6KHBEME9naK7KAJ949xYCpZE6FQD/h5tUQ5ESiPUpRy5h2yhpZMxRJpW0+ySekaE5wW575eSPPv8Nl5W0jVhF5nQz09W8Wh7aLcrwfmY8Y4ckptGWQjhcm1+FHSZSinTA83YES4LwG6SFqKjWgVnmeDcRrC0Rkv8DkqioKqfhmlvgjUCTrMDv/hY8fB x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?sgmo10GTZXCH495uehV4Vy4Qfk81nq4SzxqXF3gj8u70rLsMBlesUccFlg?= =?iso-8859-1?Q?b6Gbtp6xQ9JIeA3LHqgHHTxNnflgiimaZKdfvj5UDLvlRLlYXPHcQx/WaX?= =?iso-8859-1?Q?GjWiPrrihkO2k1HVmf+xzYLtYiJRVKB/Nc+kBtnEbydOHb80hv9fX6R9wA?= =?iso-8859-1?Q?YngY0QciK8w6Di9xcavuCsoytsCqKX9nqqliwVDqNI6VFUNCvaaeNEtq55?= =?iso-8859-1?Q?wTxdNV4zZeefKzyqZo+6DlvYQPN0xE+MZ1QcpSWiByYRbEwua3HU6ZIb+G?= =?iso-8859-1?Q?m+J7vE3Gs4v4WlmiIK2nrNAjE5DpNTyiPubeIUduRuJ9mj9RNbtgF7ZTRN?= =?iso-8859-1?Q?Beqy6AVAlACNtlH4lmphEKIEiEjtFiXlM3WszXfH9Rm+1J/tY/NfglMAI2?= =?iso-8859-1?Q?vW7gOa1V+80O7ulnDMEbDK+9T2cYrE4iFxkt8wLfR+3to4cwxfBuEpv2EE?= =?iso-8859-1?Q?2JJqhPQaQ23bzWlPWm809IvdwmXXnUd+ZLUpBYDV1mDeoiqSyoiu8IiLzM?= =?iso-8859-1?Q?C8mQXPOmB8S2FSFmvgA7waEXG5g72o+hP0BTz9C7F7IcQQdvts4Cg707Fl?= =?iso-8859-1?Q?DWqtF8LmkXv1ZnWSaAbpZ3KS1lu/DHD/7muTya5fzZELDatgqF9Eg+J5ON?= =?iso-8859-1?Q?8oqWzCDRNb5WnbLUukVL/8xDQj068oxGde+fjq80Psq+ZPSCc/h77sylJe?= =?iso-8859-1?Q?i3zAqIT2CZ9LHgUqTy1KwBhmSjHDQTJsC9iOtz4TSfXO/NYRXi6KnMEqEz?= =?iso-8859-1?Q?2ooByVBl5wbytGI+N6T25VpeGv4Tr1Ny3pVWkhfTprzTZPqpZdrHdAjHOp?= =?iso-8859-1?Q?HzTZevBlL+dTRiPu7Hth3yt3Rfoilfwfw9CDI6D7fLKmDfmevoc5FdGEAq?= =?iso-8859-1?Q?HWM/5u3dIfWxeOGjtbnNbWPlgv/c6GNeYM2ktSLrYaRyE6aBHTe0PcoufV?= =?iso-8859-1?Q?SZwdmtAsK81GeK2xrv5uhffAmH8tuW6Kn/U/GjK1pVVH+Ew6HjoKRdNgHe?= =?iso-8859-1?Q?sRP9r6UBgjjjinpdV1kkg37P80/UkTKxydJ/E/9NIUEsNWA+MNRXxm/oHd?= =?iso-8859-1?Q?xbHuhRzzSNV4SH+vrN8DtyIGvCIyWMVLQOo6JJdu8gdt3aJLgfco/j9Y3y?= =?iso-8859-1?Q?IwhzSjTYnLQyh+fz0B2cFOpCu3yortvcjoMlMkPlBKgnoXE/wAqxd0njPO?= =?iso-8859-1?Q?21nH1KXel0q2MAWQoVDQCvOF5wOjt8wW0aY7IRNwqzT2n1xgvofcUnT3sc?= =?iso-8859-1?Q?5uM+H+/aNvd/5rPxF7VVeGgMFBN1uDAM/naQr2DlSmSUvSkZHJ9G+czP04?= =?iso-8859-1?Q?yyMdELZEQe3vx6OSDvIlnUK1RiSn8GhIrV+oKfcL0YcEPrmZvbwkc26Ei5?= =?iso-8859-1?Q?Iacy4NQk0s?= Content-Type: multipart/alternative; boundary="_000_DU0PR10MB60604534BEBB91C67FD743A78A942DU0PR10MB6060EURP_" MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-Network-Message-Id: 55337b4e-99f9-489f-0865-08dcc6d825e6 X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Aug 2024 20:38:03.0796 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: PR3PR10MB4125 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --_000_DU0PR10MB60604534BEBB91C67FD743A78A942DU0PR10MB6060EURP_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hey! I just spent some time extensively documenting a draft e-mail I planned to = send over here to report what I thought was a bug. I was playing around with tsvector and full text search for a pet project o= f mine, ingesting books' content. I was aware of the limitations of FTS and tried to ensure I didn't hit any - but what = I missed was that the maximum allowed lexeme position was 16383 and everyth= ing above silently gets set to 16383. I was searching for a phrase (two wor= ds) at the end of the book and couldn't find it. After debugging I realized= that my phrase's lexemes were being set to 16383, which was inaccurate. I acknowledge that this limitation still sets the single lexeme and makes i= t searchable. I commend the authors for its elegant approach of still provi= ding functionality beyond the limitations that save space. The problem I had is that it breaks FOLLOWED BY queries, essentially stoppi= ng you from being able to match on phrases (more than one word) on large te= xt. I have a few questions here: 1. Why is this still the case? The earliest I could see this limitation in place was from PG 8.3, released 16 years = ago (2008). Hardware, and in particular memory and storage, have gone a lon= g way since. Do these limits still make sense? I saw a thread from 6 years ago (2018) which asked a simi= lar question, but it was focused on hitting the 1MB limit of the tsvector. There, a member suggested that you'd probably need a whole new type (e.g bi= gtsvector) to make this work. Somebody else shared a project called tsvecto= r2 which was a beefed-up version. There was also a patch that attempted to remove the 1MB limit, but = the contributor had it stall and never got it merged. I'm not aware if that= would have fixed the positional limitation. 1. Are there any other drawbacks that I'm missing? My understanding of the 16383 position limit is the following: * Lexemes present beyond the 16383 character mark will still match search que= ries, because they'll be represented as 'lexeme: 16383' * Phrase searches for lexemes beyond the 16383 character mark won't ever find= anything Is there anything else I'm missing? If no - I'd be happy to contribute this piece of text to the docs - particu= larly in the limitations part, as I believe this would help people reading = it. 1. What is the preferred/conventional way to store large text? If I want to store a whole books' content - like PostgreSQL: Up and Running= (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page= books average about 82500 words, English words average about 6.5-4 charact= ers, meaning a tsvector will hold the positions of no more than [2520-3277]= words). 30 rows sounds fair, but there are two complications: * Splitting the chapters in a logical way is tricky * FOLLOWED BY queries between the rows will not work. This is fine when it's = separated logically, e.g like a chapter, but not if it's separated within a= chapter. The example book has 10 chapters, so it'd average around 3 rows a= chapter. Are there any unofficial larger tsvector types? Am I doing anything wrong? Thanks for reading this. I'd appreciate any response. Best, Stanislav --_000_DU0PR10MB60604534BEBB91C67FD743A78A942DU0PR10MB6060EURP_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hey!

I just spent some time extensively documenting a draft e-mail I planned to = send over here to report what I thought was a bug.

I was playing around with tsvector and full text search for a pet project o= f mine, ingesting books' content.

I was aware of the limitations of FTS and tried to ensure I didn't hit any - but what= I missed was that the maximum allowed lexeme position was 16383 and everyt= hing above silently gets set to 16383. I was searching for a phrase (two wo= rds) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being se= t to 16383, which was inaccurate.

I acknowledge that this limitation still sets the single lexeme and makes i= t searchable. I commend the authors for its elegant approach of still provi= ding functionality beyond the limitations that save space.

The problem I had is that it breaks FOLLOWED BY queries, essentially stoppi= ng you from being able to match on phrases (more than one word) on large te= xt.

I have a few questions here:

  1. Why is this still the case?

The earliest I could see this limitation in place was from PG 8.3, released 16 years ago (2008). Hardware, and in particular memor= y and storage, have gone a long way since.
Do these limits still make sense?

I saw a thread from 6 years ago (2018) which asked a similar question, b= ut it was focused on hitting the 1MB limit of the tsvector.
There, a member suggested that you'd probably need a whole new type (e.g bi= gtsvector) to make this work. Somebody else shared a project called tsvector2 which was a beefed-up version.
There was also a patch that attempted to remove the 1MB limit, but the contributo= r had it stall and never got it merged. I'm not aware if that would have fi= xed the positional limitation.

  1. Are there any other drawbacks that I'm missing?

My understanding of the 16383 position limit is the following:
  • Lexemes present beyond the 16383 character ma= rk will still match search queries, because they'll be represented as 'lexe= me: 16383'
  • Phrase searches for lexemes beyond the 16383 = character mark won't ever find anything
Is there anything else I'm missing?

If no - I'd be happy to contribute this piece of text to the docs - particu= larly in the limitations part, as I believe this would help people reading = it.

  1. What is the preferred/conventional way to sto= re large text?

If I want to store a whole books' content - like PostgreSQL: Up and Running= (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page= books average about 82500 words, English words average about 6.5-4 charact= ers, meaning a tsvector will hold the positions of no more than [2520-3277] words).

30 rows sounds fair, but there are two complications:
  • Splitting the chapters in a logical way is tr= icky
  • FOLLOWED BY queries between the rows will not= work. This is fine when it's separated logically, e.g like a chapter, but = not if it's separated within a chapter. The example book has 10 chapters, s= o it'd average around 3 rows a chapter.

Are there any unofficial larger tsvector types? Am I doing anything wrong?<= /div>

Thanks for reading this. I'd appreciate any response.

Best,
Stanislav
--_000_DU0PR10MB60604534BEBB91C67FD743A78A942DU0PR10MB6060EURP_--