public inbox for [email protected]
help / color / mirror / Atom feedRe: Use correct collation in pg_trgm
7+ messages / 3 participants
[nested] [flat]
* Re: Use correct collation in pg_trgm
@ 2026-01-26 13:33 ` Alexander Korotkov <[email protected]>
2026-01-30 08:42 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
1 sibling, 1 reply; 7+ messages in thread
From: Alexander Korotkov @ 2026-01-26 13:33 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: David Geier <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Kirill Reshke <[email protected]>
Hi!
Thank you for working on this.
On Mon, Jan 26, 2026 at 3:24 PM Zsolt Parragi <[email protected]> wrote:
>
> > While reading through [1] I realized that the word boundary detection
> > also uses the wrong collation. Patch 0002 fixes that.
>
> Good catch, I considered checking that when I reviewed the patch, but
> I thought that it would be an issue with CJK languages, and I was
> completely wrong about that.
>
> Updated patch looks good, I only noticed two minor things:
>
> * the new test should have a newline at the end of the file
> * and probably a conditional skip based on locale availability, the
> citext_utf8.sql test case does something similar
I wonder about existing indexes, which already use default collation
not column collation. Should we add a release note saying they might
be obsolete? Alternatively, given we now have opclass options, we may
add a new opclass option defining whether to use column collation
(must be false for existing indexes).
------
Regards,
Alexander Korotkov
Supabase
^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Use correct collation in pg_trgm
2026-01-26 13:33 ` Re: Use correct collation in pg_trgm Alexander Korotkov <[email protected]>
@ 2026-01-30 08:42 ` David Geier <[email protected]>
2026-02-02 20:33 ` Re: Use correct collation in pg_trgm Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 7+ messages in thread
From: David Geier @ 2026-01-30 08:42 UTC (permalink / raw)
To: Alexander Korotkov <[email protected]>; Zsolt Parragi <[email protected]>; +Cc: pgsql-hackers; Heikki Linnakangas <[email protected]>; Kirill Reshke <[email protected]>
Attached is v3 of the patch with the following changes:
On 26.01.2026 14:33, Alexander Korotkov wrote:
>> Updated patch looks good, I only noticed two minor things:
>>
>> * the new test should have a newline at the end of the file
Done.
>> * and probably a conditional skip based on locale availability, the
>> citext_utf8.sql test case does something similar
Done, the same way collate.icu.utf8.sql does it. This test also uses the
Turkish collation tr-x-icu.
> I wonder about existing indexes, which already use default collation
> not column collation. Should we add a release note saying they might
> be obsolete? Alternatively, given we now have opclass options, we may
> add a new opclass option defining whether to use column collation
> (must be false for existing indexes).
I think adding a release note is fine. We discussed this in [1] already
and there's precedence for doing it this way. See [2].
Looking at [3], it seems like we don't include release notes in
bug fix commits but rather collect them retroactively before cutting the
release. The file doc/src/sgml/release-19.sgml is also still completely
empty on master.
--
David Geier
[1]
https://www.postgresql.org/message-id/CAEze2WiUL9idZBbuUN%2BMuWqr6DcPr_-C91E9MTx%3DH62Xx5fHaQ%40mail...
[2] https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION
[3]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0
Attachments:
[text/x-patch] v3-0002-Use-correct-collation-for-finding-word-boundaries.patch (8.9K, 2-v3-0002-Use-correct-collation-for-finding-word-boundaries.patch)
download | inline diff:
From 9300b1f3124551439405014a659329ff63f0afa4 Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Fri, 23 Jan 2026 15:39:06 +0100
Subject: [PATCH v3 2/2] Use correct collation for finding word boundaries
pg_trgm finds all words in the input string and creates trigrams for
them. Word characters are alpha-numeric characters. What qualifies as
alpha-numeric character depends on the collation. Previously, pg_trgm
always used the default collation. Now the specified collation is used
instead.
---
.../pg_trgm/expected/pg_trgm_collation.out | 13 ++++++++++++
contrib/pg_trgm/sql/pg_trgm_collation.sql | 5 +++++
contrib/pg_trgm/trgm.h | 6 +++---
contrib/pg_trgm/trgm_op.c | 21 ++++++++++---------
contrib/pg_trgm/trgm_regexp.c | 2 +-
src/backend/tsearch/ts_locale.c | 4 ++--
src/include/tsearch/ts_locale.h | 9 +++++++-
7 files changed, 43 insertions(+), 17 deletions(-)
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation.out b/contrib/pg_trgm/expected/pg_trgm_collation.out
index 472ce867665..0cc53edd821 100644
--- a/contrib/pg_trgm/expected/pg_trgm_collation.out
+++ b/contrib/pg_trgm/expected/pg_trgm_collation.out
@@ -41,3 +41,16 @@ SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
0.54545456
(1 row)
+-- Test that word boundary identification uses specified collation
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "tr-x-icu");
+ show_trgm
+-------------------------------------------------------------------------------------------------------------------------------
+ {0x8fc0a2,0x93dfbf,0x1bf43c," h"," he",0x22d44f,0x4398ff,cod,"de ",dic,ell,est,hel,ico,ldi,llo,ode,orl,0x71b8f5,rld,tes,wor}
+(1 row)
+
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "C");
+ show_trgm
+-------------------------------------------------------------------------------------------------------------
+ {" c"," h"," t"," co"," he"," te",cod,"de ",ell,est,hel,iwo,"ld ",llo,"lo ",ode,orl,rld,sti,tes,tiw,wor}
+(1 row)
+
diff --git a/contrib/pg_trgm/sql/pg_trgm_collation.sql b/contrib/pg_trgm/sql/pg_trgm_collation.sql
index 128c97ceb54..f50d67185c7 100644
--- a/contrib/pg_trgm/sql/pg_trgm_collation.sql
+++ b/contrib/pg_trgm/sql/pg_trgm_collation.sql
@@ -21,3 +21,8 @@ SELECT show_trgm('ISTANBUL' COLLATE "C");
SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+
+-- Test that word boundary identification uses specified collation
+
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "tr-x-icu");
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "C");
\ No newline at end of file
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 147eefaa3c6..6f0e0a20789 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -47,9 +47,9 @@ typedef char trgm[3];
} while(0)
extern int (*CMPTRGM) (const void *a, const void *b);
-#define ISWORDCHR(c) (t_isalnum(c))
-#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
-#define ISPRINTABLETRGM(t) ( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
+#define ISWORDCHR(c, collation) (t_isalnum_collation(c, collation))
+#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
+#define ISPRINTABLETRGM(t) ( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
#define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
#define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index d4c75caeff9..399e6a26f34 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -220,11 +220,11 @@ comp_trgm(const void *a, const void *b)
* endword points to the character after word
*/
static char *
-find_word(char *str, int lenstr, char **endword, int *charlen)
+find_word(char *str, int lenstr, char **endword, int *charlen, Oid collation)
{
char *beginword = str;
- while (beginword - str < lenstr && !ISWORDCHR(beginword))
+ while (beginword - str < lenstr && !ISWORDCHR(beginword, collation))
beginword += pg_mblen(beginword);
if (beginword - str >= lenstr)
@@ -232,7 +232,7 @@ find_word(char *str, int lenstr, char **endword, int *charlen)
*endword = beginword;
*charlen = 0;
- while (*endword - str < lenstr && ISWORDCHR(*endword))
+ while (*endword - str < lenstr && ISWORDCHR(*endword, collation))
{
*endword += pg_mblen(*endword);
(*charlen)++;
@@ -349,7 +349,7 @@ generate_trgm_only(trgm *trg, char *str, int slen, Oid collation, TrgmBound *bou
}
eword = str;
- while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL)
+ while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen, collation)) != NULL)
{
#ifdef IGNORECASE
bword = str_tolower(bword, eword - bword, collation);
@@ -771,7 +771,8 @@ calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
*/
static const char *
get_wildcard_part(const char *str, int lenstr,
- char *buf, int *bytelen, int *charlen)
+ char *buf, int *bytelen, int *charlen,
+ Oid collation)
{
const char *beginword = str;
const char *endword;
@@ -791,7 +792,7 @@ get_wildcard_part(const char *str, int lenstr,
{
if (in_escape)
{
- if (ISWORDCHR(beginword))
+ if (ISWORDCHR(beginword, collation))
break;
in_escape = false;
in_leading_wildcard_meta = false;
@@ -802,7 +803,7 @@ get_wildcard_part(const char *str, int lenstr,
in_escape = true;
else if (ISWILDCARDCHAR(beginword))
in_leading_wildcard_meta = true;
- else if (ISWORDCHR(beginword))
+ else if (ISWORDCHR(beginword, collation))
break;
else
in_leading_wildcard_meta = false;
@@ -845,7 +846,7 @@ get_wildcard_part(const char *str, int lenstr,
clen = pg_mblen(endword);
if (in_escape)
{
- if (ISWORDCHR(endword))
+ if (ISWORDCHR(endword, collation))
{
memcpy(s, endword, clen);
(*charlen)++;
@@ -873,7 +874,7 @@ get_wildcard_part(const char *str, int lenstr,
in_trailing_wildcard_meta = true;
break;
}
- else if (ISWORDCHR(endword))
+ else if (ISWORDCHR(endword, collation))
{
memcpy(s, endword, clen);
(*charlen)++;
@@ -945,7 +946,7 @@ generate_wildcard_trgm(const char *str, int slen, Oid collation)
*/
eword = str;
while ((eword = get_wildcard_part(eword, slen - (eword - str),
- buf, &bytelen, &charlen)) != NULL)
+ buf, &bytelen, &charlen, collation)) != NULL)
{
#ifdef IGNORECASE
buf2 = str_tolower(buf, bytelen, collation);
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index cdd04fa01ad..a89dbe1880b 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -810,7 +810,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation)
if (!convertPgWchar(chars[j], &c, collation))
continue; /* ok to ignore it altogether */
- if (ISWORDCHR(c.bytes))
+ if (ISWORDCHR(c.bytes, collation))
colorInfo->wordChars[colorInfo->wordCharsCount++] = c;
else
colorInfo->containsNonWord = true;
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 1e98f321957..3caad2e5c2a 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -37,7 +37,7 @@ t_isalpha(const char *ptr)
}
int
-t_isalnum(const char *ptr)
+t_isalnum_collation(const char *ptr, Oid collation)
{
pg_wchar wstr[WC_BUF_LEN];
int wlen pg_attribute_unused();
@@ -46,7 +46,7 @@ t_isalnum(const char *ptr)
Assert(wlen <= 1);
/* pass single character, or NUL if empty */
- return pg_iswalnum(wstr[0], pg_database_locale());
+ return pg_iswalnum(wstr[0], pg_newlocale_from_collation(collation));
}
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index cea417a91b5..b1a2a45ed9d 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -18,6 +18,7 @@
#include "lib/stringinfo.h"
#include "mb/pg_wchar.h"
+#include "catalog/pg_collation.h"
#include "utils/pg_locale.h"
/* working state for tsearch_readline (should be a local var in caller) */
@@ -40,7 +41,13 @@ typedef struct
#define COPYCHAR(d,s) memcpy(d, s, pg_mblen(s))
extern int t_isalpha(const char *ptr);
-extern int t_isalnum(const char *ptr);
+extern int t_isalnum_collation(const char *ptr, Oid collation);
+
+static inline int
+t_isalnum(const char *ptr)
+{
+ return t_isalnum_collation(ptr, DEFAULT_COLLATION_OID);
+}
extern bool tsearch_readline_begin(tsearch_readline_state *stp,
const char *filename);
--
2.51.0
[text/x-patch] v3-0001-Use-correct-collation-for-lowercasing.patch (18.8K, 3-v3-0001-Use-correct-collation-for-lowercasing.patch)
download | inline diff:
From e234d513abc2214620e04de43d5b2d498de65f16 Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Wed, 21 Jan 2026 14:54:28 +0100
Subject: [PATCH v3 1/2] Use correct collation for lowercasing
pg_trgm converts the input words to lowercase before extracting the
trigrams. The lowercase conversion depends on the collation. Previously,
pg_trgm always used the default collation. Now, the specified collation
is used instead.
---
contrib/pg_trgm/Makefile | 2 +-
.../pg_trgm/expected/pg_trgm_collation.out | 43 +++++++++++++
contrib/pg_trgm/meson.build | 1 +
contrib/pg_trgm/sql/pg_trgm_collation.sql | 23 +++++++
contrib/pg_trgm/trgm.h | 4 +-
contrib/pg_trgm/trgm_gin.c | 7 ++-
contrib/pg_trgm/trgm_gist.c | 10 ++--
contrib/pg_trgm/trgm_op.c | 60 ++++++++++---------
contrib/pg_trgm/trgm_regexp.c | 20 +++----
9 files changed, 122 insertions(+), 48 deletions(-)
create mode 100644 contrib/pg_trgm/expected/pg_trgm_collation.out
create mode 100644 contrib/pg_trgm/sql/pg_trgm_collation.sql
diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
index 1fbdc9ec1ef..1450f2aaea7 100644
--- a/contrib/pg_trgm/Makefile
+++ b/contrib/pg_trgm/Makefile
@@ -14,7 +14,7 @@ DATA = pg_trgm--1.5--1.6.sql pg_trgm--1.4--1.5.sql pg_trgm--1.3--1.4.sql \
pg_trgm--1.0--1.1.sql
PGFILEDESC = "pg_trgm - trigram matching"
-REGRESS = pg_trgm pg_word_trgm pg_strict_word_trgm
+REGRESS = pg_trgm pg_word_trgm pg_strict_word_trgm pg_trgm_collation
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation.out b/contrib/pg_trgm/expected/pg_trgm_collation.out
new file mode 100644
index 00000000000..472ce867665
--- /dev/null
+++ b/contrib/pg_trgm/expected/pg_trgm_collation.out
@@ -0,0 +1,43 @@
+/*
+ * This test is for ICU collations.
+ */
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- Test that lowercase conversion of trigrams uses specified collation
+CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
+INSERT INTO test VALUES ('ISTANBUL');
+SELECT show_trgm(col) FROM test;
+ show_trgm
+--------------------------------------------------------
+ {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT show_trgm('ISTANBUL' COLLATE "tr-x-icu");
+ show_trgm
+--------------------------------------------------------
+ {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT show_trgm('ISTANBUL' COLLATE "C");
+ show_trgm
+---------------------------------------------
+ {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
+ similarity
+------------
+ 1
+(1 row)
+
+SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+ similarity
+------------
+ 0.54545456
+(1 row)
+
diff --git a/contrib/pg_trgm/meson.build b/contrib/pg_trgm/meson.build
index 3cc299d5eaa..95d2806dda6 100644
--- a/contrib/pg_trgm/meson.build
+++ b/contrib/pg_trgm/meson.build
@@ -41,6 +41,7 @@ tests += {
'pg_trgm',
'pg_word_trgm',
'pg_strict_word_trgm',
+ 'pg_trgm_collation',
],
},
}
diff --git a/contrib/pg_trgm/sql/pg_trgm_collation.sql b/contrib/pg_trgm/sql/pg_trgm_collation.sql
new file mode 100644
index 00000000000..128c97ceb54
--- /dev/null
+++ b/contrib/pg_trgm/sql/pg_trgm_collation.sql
@@ -0,0 +1,23 @@
+/*
+ * This test is for ICU collations.
+ */
+
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- Test that lowercase conversion of trigrams uses specified collation
+
+CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
+INSERT INTO test VALUES ('ISTANBUL');
+SELECT show_trgm(col) FROM test;
+SELECT show_trgm('ISTANBUL' COLLATE "tr-x-icu");
+
+SELECT show_trgm('ISTANBUL' COLLATE "C");
+
+SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
+SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index ca017585369..147eefaa3c6 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -119,8 +119,8 @@ extern double strict_word_similarity_threshold;
extern double index_strategy_get_limit(StrategyNumber strategy);
extern uint32 trgm2int(trgm *ptr);
extern void compact_trigram(trgm *tptr, char *str, int bytelen);
-extern TRGM *generate_trgm(char *str, int slen);
-extern TRGM *generate_wildcard_trgm(const char *str, int slen);
+extern TRGM *generate_trgm(char *str, int slen, Oid collation);
+extern TRGM *generate_wildcard_trgm(const char *str, int slen, Oid collation);
extern float4 cnt_sml(TRGM *trg1, TRGM *trg2, bool inexact);
extern bool trgm_contained_by(TRGM *trg1, TRGM *trg2);
extern bool *trgm_presence_map(TRGM *query, TRGM *key);
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
index 014bb3c848c..125e71ca23a 100644
--- a/contrib/pg_trgm/trgm_gin.c
+++ b/contrib/pg_trgm/trgm_gin.c
@@ -42,7 +42,7 @@ gin_extract_value_trgm(PG_FUNCTION_ARGS)
*nentries = 0;
- trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
trglen = ARRNELEM(trg);
if (trglen > 0)
@@ -93,7 +93,7 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
case WordSimilarityStrategyNumber:
case StrictWordSimilarityStrategyNumber:
case EqualStrategyNumber:
- trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
break;
case ILikeStrategyNumber:
#ifndef IGNORECASE
@@ -107,7 +107,8 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
* potentially-matching string must include.
*/
trg = generate_wildcard_trgm(VARDATA_ANY(val),
- VARSIZE_ANY_EXHDR(val));
+ VARSIZE_ANY_EXHDR(val),
+ PG_GET_COLLATION());
break;
case RegExpICaseStrategyNumber:
#ifndef IGNORECASE
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 2f0d61985a5..22cc2a6f7b1 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -123,7 +123,7 @@ gtrgm_compress(PG_FUNCTION_ARGS)
TRGM *res;
text *val = DatumGetTextPP(entry->key);
- res = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ res = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
retval = palloc_object(GISTENTRY);
gistentryinit(*retval, PointerGetDatum(res),
entry->rel, entry->page,
@@ -242,7 +242,8 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
case StrictWordSimilarityStrategyNumber:
case EqualStrategyNumber:
qtrg = generate_trgm(VARDATA(query),
- querysize - VARHDRSZ);
+ querysize - VARHDRSZ,
+ PG_GET_COLLATION());
break;
case ILikeStrategyNumber:
#ifndef IGNORECASE
@@ -251,7 +252,8 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
/* FALL THRU */
case LikeStrategyNumber:
qtrg = generate_wildcard_trgm(VARDATA(query),
- querysize - VARHDRSZ);
+ querysize - VARHDRSZ,
+ PG_GET_COLLATION());
break;
case RegExpICaseStrategyNumber:
#ifndef IGNORECASE
@@ -475,7 +477,7 @@ gtrgm_distance(PG_FUNCTION_ARGS)
{
char *newcache;
- qtrg = generate_trgm(VARDATA(query), querysize - VARHDRSZ);
+ qtrg = generate_trgm(VARDATA(query), querysize - VARHDRSZ, PG_GET_COLLATION());
newcache = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
MAXALIGN(querysize) +
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index 81182a15e07..d4c75caeff9 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -324,7 +324,7 @@ make_trigrams(trgm *tptr, char *str, int bytelen, int charlen)
* Returns length of the generated array.
*/
static int
-generate_trgm_only(trgm *trg, char *str, int slen, TrgmBound *bounds)
+generate_trgm_only(trgm *trg, char *str, int slen, Oid collation, TrgmBound *bounds)
{
trgm *tptr;
char *buf;
@@ -352,7 +352,7 @@ generate_trgm_only(trgm *trg, char *str, int slen, TrgmBound *bounds)
while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL)
{
#ifdef IGNORECASE
- bword = str_tolower(bword, eword - bword, DEFAULT_COLLATION_OID);
+ bword = str_tolower(bword, eword - bword, collation);
bytelen = strlen(bword);
#else
bytelen = eword - bword;
@@ -405,7 +405,7 @@ protect_out_of_mem(int slen)
* Returns the sorted array of unique trigrams.
*/
TRGM *
-generate_trgm(char *str, int slen)
+generate_trgm(char *str, int slen, Oid collation)
{
TRGM *trg;
int len;
@@ -415,7 +415,7 @@ generate_trgm(char *str, int slen)
trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) * 3);
trg->flag = ARRKEY;
- len = generate_trgm_only(GETARR(trg), str, slen, NULL);
+ len = generate_trgm_only(GETARR(trg), str, slen, collation, NULL);
SET_VARSIZE(trg, CALCGTSIZE(ARRKEY, len));
if (len == 0)
@@ -671,7 +671,7 @@ iterate_word_similarity(int *trg2indexes,
*/
static float4
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
- uint8 flags)
+ uint8 flags, Oid collation)
{
bool *found;
pos_trgm *ptrg;
@@ -697,8 +697,8 @@ calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
else
bounds = NULL;
- len1 = generate_trgm_only(trg1, str1, slen1, NULL);
- len2 = generate_trgm_only(trg2, str2, slen2, bounds);
+ len1 = generate_trgm_only(trg1, str1, slen1, collation, NULL);
+ len2 = generate_trgm_only(trg2, str2, slen2, collation, bounds);
ptrg = make_positional_trgm(trg1, len1, trg2, len2);
len = len1 + len2;
@@ -915,7 +915,7 @@ get_wildcard_part(const char *str, int lenstr,
* " a", "bcd" would be extracted.
*/
TRGM *
-generate_wildcard_trgm(const char *str, int slen)
+generate_wildcard_trgm(const char *str, int slen, Oid collation)
{
TRGM *trg;
char *buf,
@@ -948,7 +948,7 @@ generate_wildcard_trgm(const char *str, int slen)
buf, &bytelen, &charlen)) != NULL)
{
#ifdef IGNORECASE
- buf2 = str_tolower(buf, bytelen, DEFAULT_COLLATION_OID);
+ buf2 = str_tolower(buf, bytelen, collation);
bytelen = strlen(buf2);
#else
buf2 = buf;
@@ -1007,7 +1007,7 @@ show_trgm(PG_FUNCTION_ARGS)
trgm *ptr;
int i;
- trg = generate_trgm(VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in));
+ trg = generate_trgm(VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in), PG_GET_COLLATION());
d = palloc_array(Datum, 1 + ARRNELEM(trg));
for (i = 0, ptr = GETARR(trg); i < ARRNELEM(trg); i++, ptr++)
@@ -1174,8 +1174,8 @@ similarity(PG_FUNCTION_ARGS)
*trg2;
float4 res;
- trg1 = generate_trgm(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1));
- trg2 = generate_trgm(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2));
+ trg1 = generate_trgm(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1), PG_GET_COLLATION());
+ trg2 = generate_trgm(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2), PG_GET_COLLATION());
res = cnt_sml(trg1, trg2, false);
@@ -1196,7 +1196,7 @@ word_similarity(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1212,7 +1212,7 @@ strict_word_similarity(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1222,9 +1222,10 @@ strict_word_similarity(PG_FUNCTION_ARGS)
Datum
similarity_dist(PG_FUNCTION_ARGS)
{
- float4 res = DatumGetFloat4(DirectFunctionCall2(similarity,
- PG_GETARG_DATUM(0),
- PG_GETARG_DATUM(1)));
+ float4 res = DatumGetFloat4(DirectFunctionCall2Coll(similarity,
+ PG_GET_COLLATION(),
+ PG_GETARG_DATUM(0),
+ PG_GETARG_DATUM(1)));
PG_RETURN_FLOAT4(1.0 - res);
}
@@ -1232,9 +1233,10 @@ similarity_dist(PG_FUNCTION_ARGS)
Datum
similarity_op(PG_FUNCTION_ARGS)
{
- float4 res = DatumGetFloat4(DirectFunctionCall2(similarity,
- PG_GETARG_DATUM(0),
- PG_GETARG_DATUM(1)));
+ float4 res = DatumGetFloat4(DirectFunctionCall2Coll(similarity,
+ PG_GET_COLLATION(),
+ PG_GETARG_DATUM(0),
+ PG_GETARG_DATUM(1)));
PG_RETURN_BOOL(res >= similarity_threshold);
}
@@ -1248,7 +1250,7 @@ word_similarity_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_CHECK_ONLY);
+ WORD_SIMILARITY_CHECK_ONLY, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1264,7 +1266,7 @@ word_similarity_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_CHECK_ONLY);
+ WORD_SIMILARITY_CHECK_ONLY, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1280,7 +1282,7 @@ word_similarity_dist_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1296,7 +1298,7 @@ word_similarity_dist_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1312,7 +1314,8 @@ strict_word_similarity_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT,
+ PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1328,7 +1331,8 @@ strict_word_similarity_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT,
+ PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1344,7 +1348,7 @@ strict_word_similarity_dist_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1360,7 +1364,7 @@ strict_word_similarity_dist_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1d1b5fe304d..cdd04fa01ad 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -479,11 +479,11 @@ typedef struct
/* prototypes for private functions */
static TRGM *createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
- MemoryContext rcontext);
+ MemoryContext rcontext, Oid collation);
static void RE_compile(regex_t *regex, text *text_re,
int cflags, Oid collation);
-static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA);
-static bool convertPgWchar(pg_wchar c, trgm_mb_char *result);
+static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation);
+static bool convertPgWchar(pg_wchar c, trgm_mb_char *result, Oid collation);
static void transformGraph(TrgmNFA *trgmNFA);
static void processState(TrgmNFA *trgmNFA, TrgmState *state);
static void addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key);
@@ -551,7 +551,7 @@ createTrgmNFA(text *text_re, Oid collation,
REG_ADVANCED | REG_NOSUB, collation);
#endif
- trg = createTrgmNFAInternal(®ex, graph, rcontext);
+ trg = createTrgmNFAInternal(®ex, graph, rcontext, collation);
/* Clean up all the cruft we created (including regex) */
MemoryContextSwitchTo(oldcontext);
@@ -565,7 +565,7 @@ createTrgmNFA(text *text_re, Oid collation,
*/
static TRGM *
createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
- MemoryContext rcontext)
+ MemoryContext rcontext, Oid collation)
{
TRGM *trg;
TrgmNFA trgmNFA;
@@ -573,7 +573,7 @@ createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
trgmNFA.regex = regex;
/* Collect color information from the regex */
- getColorInfo(regex, &trgmNFA);
+ getColorInfo(regex, &trgmNFA, collation);
#ifdef TRGM_REGEXP_DEBUG
printSourceNFA(regex, trgmNFA.colorInfo, trgmNFA.ncolors);
@@ -762,7 +762,7 @@ RE_compile(regex_t *regex, text *text_re, int cflags, Oid collation)
* Fill TrgmColorInfo structure for each color using regex export functions.
*/
static void
-getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
+getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation)
{
int colorsCount = pg_reg_getnumcolors(regex);
int i;
@@ -808,7 +808,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
{
trgm_mb_char c;
- if (!convertPgWchar(chars[j], &c))
+ if (!convertPgWchar(chars[j], &c, collation))
continue; /* ok to ignore it altogether */
if (ISWORDCHR(c.bytes))
colorInfo->wordChars[colorInfo->wordCharsCount++] = c;
@@ -825,7 +825,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
* Returns false if the character should be ignored completely.
*/
static bool
-convertPgWchar(pg_wchar c, trgm_mb_char *result)
+convertPgWchar(pg_wchar c, trgm_mb_char *result, Oid collation)
{
/* "s" has enough space for a multibyte character and a trailing NUL */
char s[MAX_MULTIBYTE_CHAR_LEN + 1];
@@ -857,7 +857,7 @@ convertPgWchar(pg_wchar c, trgm_mb_char *result)
*/
#ifdef IGNORECASE
{
- char *lowerCased = str_tolower(s, strlen(s), DEFAULT_COLLATION_OID);
+ char *lowerCased = str_tolower(s, strlen(s), collation);
if (strcmp(lowerCased, s) != 0)
{
--
2.51.0
^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Use correct collation in pg_trgm
2026-01-26 13:33 ` Re: Use correct collation in pg_trgm Alexander Korotkov <[email protected]>
2026-01-30 08:42 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
@ 2026-02-02 20:33 ` Zsolt Parragi <[email protected]>
2026-02-03 11:03 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
0 siblings, 1 reply; 7+ messages in thread
From: Zsolt Parragi @ 2026-02-02 20:33 UTC (permalink / raw)
To: David Geier <[email protected]>; +Cc: Alexander Korotkov <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Kirill Reshke <[email protected]>
>> * the new test should have a newline at the end of the file
>
> Done.
It is still missing.
Otherwise it looks good to me.
(I don't see a commitfest entry for it?)
^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Use correct collation in pg_trgm
2026-01-26 13:33 ` Re: Use correct collation in pg_trgm Alexander Korotkov <[email protected]>
2026-01-30 08:42 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
2026-02-02 20:33 ` Re: Use correct collation in pg_trgm Zsolt Parragi <[email protected]>
@ 2026-02-03 11:03 ` David Geier <[email protected]>
2026-03-02 13:47 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
0 siblings, 1 reply; 7+ messages in thread
From: David Geier @ 2026-02-03 11:03 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: Alexander Korotkov <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Kirill Reshke <[email protected]>
On 02.02.2026 21:33, Zsolt Parragi wrote:
>>> * the new test should have a newline at the end of the file
>>
>> Done.
>
> It is still missing.
Sorry, my bad. Now it's there.
> Otherwise it looks good to me.
>
> (I don't see a commitfest entry for it?)
I thought Alexander might pick it up without CF entry. I created now
one, see [1]. Will you add yourself as reviewer?
--
David Geier
[1] https://commitfest.postgresql.org/patch/6454/
Attachments:
[text/x-patch] v4-0002-Use-correct-collation-for-finding-word-boundaries.patch (8.9K, 2-v4-0002-Use-correct-collation-for-finding-word-boundaries.patch)
download | inline diff:
From 85959ed1dbc3a8138db086d6d243f0b2d088138b Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Fri, 23 Jan 2026 15:39:06 +0100
Subject: [PATCH v4 2/2] Use correct collation for finding word boundaries
pg_trgm finds all words in the input string and creates trigrams for
them. Word characters are alpha-numeric characters. What qualifies as
alpha-numeric character depends on the collation. Previously, pg_trgm
always used the default collation. Now the specified collation is used
instead.
---
.../pg_trgm/expected/pg_trgm_collation.out | 13 ++++++++++++
contrib/pg_trgm/sql/pg_trgm_collation.sql | 5 +++++
contrib/pg_trgm/trgm.h | 6 +++---
contrib/pg_trgm/trgm_op.c | 21 ++++++++++---------
contrib/pg_trgm/trgm_regexp.c | 2 +-
src/backend/tsearch/ts_locale.c | 4 ++--
src/include/tsearch/ts_locale.h | 9 +++++++-
7 files changed, 43 insertions(+), 17 deletions(-)
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation.out b/contrib/pg_trgm/expected/pg_trgm_collation.out
index 472ce867665..0cc53edd821 100644
--- a/contrib/pg_trgm/expected/pg_trgm_collation.out
+++ b/contrib/pg_trgm/expected/pg_trgm_collation.out
@@ -41,3 +41,16 @@ SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
0.54545456
(1 row)
+-- Test that word boundary identification uses specified collation
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "tr-x-icu");
+ show_trgm
+-------------------------------------------------------------------------------------------------------------------------------
+ {0x8fc0a2,0x93dfbf,0x1bf43c," h"," he",0x22d44f,0x4398ff,cod,"de ",dic,ell,est,hel,ico,ldi,llo,ode,orl,0x71b8f5,rld,tes,wor}
+(1 row)
+
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "C");
+ show_trgm
+-------------------------------------------------------------------------------------------------------------
+ {" c"," h"," t"," co"," he"," te",cod,"de ",ell,est,hel,iwo,"ld ",llo,"lo ",ode,orl,rld,sti,tes,tiw,wor}
+(1 row)
+
diff --git a/contrib/pg_trgm/sql/pg_trgm_collation.sql b/contrib/pg_trgm/sql/pg_trgm_collation.sql
index afb3973a8b3..e1a5c7c5fa8 100644
--- a/contrib/pg_trgm/sql/pg_trgm_collation.sql
+++ b/contrib/pg_trgm/sql/pg_trgm_collation.sql
@@ -22,3 +22,8 @@ SELECT show_trgm('ISTANBUL' COLLATE "C");
SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+-- Test that word boundary identification uses specified collation
+
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "tr-x-icu");
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "C");
+
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 147eefaa3c6..6f0e0a20789 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -47,9 +47,9 @@ typedef char trgm[3];
} while(0)
extern int (*CMPTRGM) (const void *a, const void *b);
-#define ISWORDCHR(c) (t_isalnum(c))
-#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
-#define ISPRINTABLETRGM(t) ( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
+#define ISWORDCHR(c, collation) (t_isalnum_collation(c, collation))
+#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
+#define ISPRINTABLETRGM(t) ( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
#define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
#define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index d4c75caeff9..399e6a26f34 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -220,11 +220,11 @@ comp_trgm(const void *a, const void *b)
* endword points to the character after word
*/
static char *
-find_word(char *str, int lenstr, char **endword, int *charlen)
+find_word(char *str, int lenstr, char **endword, int *charlen, Oid collation)
{
char *beginword = str;
- while (beginword - str < lenstr && !ISWORDCHR(beginword))
+ while (beginword - str < lenstr && !ISWORDCHR(beginword, collation))
beginword += pg_mblen(beginword);
if (beginword - str >= lenstr)
@@ -232,7 +232,7 @@ find_word(char *str, int lenstr, char **endword, int *charlen)
*endword = beginword;
*charlen = 0;
- while (*endword - str < lenstr && ISWORDCHR(*endword))
+ while (*endword - str < lenstr && ISWORDCHR(*endword, collation))
{
*endword += pg_mblen(*endword);
(*charlen)++;
@@ -349,7 +349,7 @@ generate_trgm_only(trgm *trg, char *str, int slen, Oid collation, TrgmBound *bou
}
eword = str;
- while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL)
+ while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen, collation)) != NULL)
{
#ifdef IGNORECASE
bword = str_tolower(bword, eword - bword, collation);
@@ -771,7 +771,8 @@ calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
*/
static const char *
get_wildcard_part(const char *str, int lenstr,
- char *buf, int *bytelen, int *charlen)
+ char *buf, int *bytelen, int *charlen,
+ Oid collation)
{
const char *beginword = str;
const char *endword;
@@ -791,7 +792,7 @@ get_wildcard_part(const char *str, int lenstr,
{
if (in_escape)
{
- if (ISWORDCHR(beginword))
+ if (ISWORDCHR(beginword, collation))
break;
in_escape = false;
in_leading_wildcard_meta = false;
@@ -802,7 +803,7 @@ get_wildcard_part(const char *str, int lenstr,
in_escape = true;
else if (ISWILDCARDCHAR(beginword))
in_leading_wildcard_meta = true;
- else if (ISWORDCHR(beginword))
+ else if (ISWORDCHR(beginword, collation))
break;
else
in_leading_wildcard_meta = false;
@@ -845,7 +846,7 @@ get_wildcard_part(const char *str, int lenstr,
clen = pg_mblen(endword);
if (in_escape)
{
- if (ISWORDCHR(endword))
+ if (ISWORDCHR(endword, collation))
{
memcpy(s, endword, clen);
(*charlen)++;
@@ -873,7 +874,7 @@ get_wildcard_part(const char *str, int lenstr,
in_trailing_wildcard_meta = true;
break;
}
- else if (ISWORDCHR(endword))
+ else if (ISWORDCHR(endword, collation))
{
memcpy(s, endword, clen);
(*charlen)++;
@@ -945,7 +946,7 @@ generate_wildcard_trgm(const char *str, int slen, Oid collation)
*/
eword = str;
while ((eword = get_wildcard_part(eword, slen - (eword - str),
- buf, &bytelen, &charlen)) != NULL)
+ buf, &bytelen, &charlen, collation)) != NULL)
{
#ifdef IGNORECASE
buf2 = str_tolower(buf, bytelen, collation);
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index cdd04fa01ad..a89dbe1880b 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -810,7 +810,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation)
if (!convertPgWchar(chars[j], &c, collation))
continue; /* ok to ignore it altogether */
- if (ISWORDCHR(c.bytes))
+ if (ISWORDCHR(c.bytes, collation))
colorInfo->wordChars[colorInfo->wordCharsCount++] = c;
else
colorInfo->containsNonWord = true;
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 1e98f321957..3caad2e5c2a 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -37,7 +37,7 @@ t_isalpha(const char *ptr)
}
int
-t_isalnum(const char *ptr)
+t_isalnum_collation(const char *ptr, Oid collation)
{
pg_wchar wstr[WC_BUF_LEN];
int wlen pg_attribute_unused();
@@ -46,7 +46,7 @@ t_isalnum(const char *ptr)
Assert(wlen <= 1);
/* pass single character, or NUL if empty */
- return pg_iswalnum(wstr[0], pg_database_locale());
+ return pg_iswalnum(wstr[0], pg_newlocale_from_collation(collation));
}
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index cea417a91b5..b1a2a45ed9d 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -18,6 +18,7 @@
#include "lib/stringinfo.h"
#include "mb/pg_wchar.h"
+#include "catalog/pg_collation.h"
#include "utils/pg_locale.h"
/* working state for tsearch_readline (should be a local var in caller) */
@@ -40,7 +41,13 @@ typedef struct
#define COPYCHAR(d,s) memcpy(d, s, pg_mblen(s))
extern int t_isalpha(const char *ptr);
-extern int t_isalnum(const char *ptr);
+extern int t_isalnum_collation(const char *ptr, Oid collation);
+
+static inline int
+t_isalnum(const char *ptr)
+{
+ return t_isalnum_collation(ptr, DEFAULT_COLLATION_OID);
+}
extern bool tsearch_readline_begin(tsearch_readline_state *stp,
const char *filename);
--
2.51.0
[text/x-patch] v4-0001-Use-correct-collation-for-lowercasing.patch (19.5K, 3-v4-0001-Use-correct-collation-for-lowercasing.patch)
download | inline diff:
From 3c443474bd1bae9295482ddcce7039b9c0a2637b Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Wed, 21 Jan 2026 14:54:28 +0100
Subject: [PATCH v4 1/2] Use correct collation for lowercasing
pg_trgm converts the input words to lowercase before extracting the
trigrams. The lowercase conversion depends on the collation. Previously,
pg_trgm always used the default collation. Now, the specified collation
is used instead.
---
contrib/pg_trgm/Makefile | 2 +-
.../pg_trgm/expected/pg_trgm_collation.out | 43 +++++++++++++
.../pg_trgm/expected/pg_trgm_collation_1.out | 10 ++++
contrib/pg_trgm/meson.build | 1 +
contrib/pg_trgm/sql/pg_trgm_collation.sql | 24 ++++++++
contrib/pg_trgm/trgm.h | 4 +-
contrib/pg_trgm/trgm_gin.c | 7 ++-
contrib/pg_trgm/trgm_gist.c | 10 ++--
contrib/pg_trgm/trgm_op.c | 60 ++++++++++---------
contrib/pg_trgm/trgm_regexp.c | 20 +++----
10 files changed, 133 insertions(+), 48 deletions(-)
create mode 100644 contrib/pg_trgm/expected/pg_trgm_collation.out
create mode 100644 contrib/pg_trgm/expected/pg_trgm_collation_1.out
create mode 100644 contrib/pg_trgm/sql/pg_trgm_collation.sql
diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
index 1fbdc9ec1ef..1450f2aaea7 100644
--- a/contrib/pg_trgm/Makefile
+++ b/contrib/pg_trgm/Makefile
@@ -14,7 +14,7 @@ DATA = pg_trgm--1.5--1.6.sql pg_trgm--1.4--1.5.sql pg_trgm--1.3--1.4.sql \
pg_trgm--1.0--1.1.sql
PGFILEDESC = "pg_trgm - trigram matching"
-REGRESS = pg_trgm pg_word_trgm pg_strict_word_trgm
+REGRESS = pg_trgm pg_word_trgm pg_strict_word_trgm pg_trgm_collation
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation.out b/contrib/pg_trgm/expected/pg_trgm_collation.out
new file mode 100644
index 00000000000..472ce867665
--- /dev/null
+++ b/contrib/pg_trgm/expected/pg_trgm_collation.out
@@ -0,0 +1,43 @@
+/*
+ * This test is for ICU collations.
+ */
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- Test that lowercase conversion of trigrams uses specified collation
+CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
+INSERT INTO test VALUES ('ISTANBUL');
+SELECT show_trgm(col) FROM test;
+ show_trgm
+--------------------------------------------------------
+ {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT show_trgm('ISTANBUL' COLLATE "tr-x-icu");
+ show_trgm
+--------------------------------------------------------
+ {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT show_trgm('ISTANBUL' COLLATE "C");
+ show_trgm
+---------------------------------------------
+ {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
+ similarity
+------------
+ 1
+(1 row)
+
+SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+ similarity
+------------
+ 0.54545456
+(1 row)
+
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation_1.out b/contrib/pg_trgm/expected/pg_trgm_collation_1.out
new file mode 100644
index 00000000000..9a737c16454
--- /dev/null
+++ b/contrib/pg_trgm/expected/pg_trgm_collation_1.out
@@ -0,0 +1,10 @@
+/*
+ * This test is for ICU collations.
+ */
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
diff --git a/contrib/pg_trgm/meson.build b/contrib/pg_trgm/meson.build
index 3cc299d5eaa..95d2806dda6 100644
--- a/contrib/pg_trgm/meson.build
+++ b/contrib/pg_trgm/meson.build
@@ -41,6 +41,7 @@ tests += {
'pg_trgm',
'pg_word_trgm',
'pg_strict_word_trgm',
+ 'pg_trgm_collation',
],
},
}
diff --git a/contrib/pg_trgm/sql/pg_trgm_collation.sql b/contrib/pg_trgm/sql/pg_trgm_collation.sql
new file mode 100644
index 00000000000..afb3973a8b3
--- /dev/null
+++ b/contrib/pg_trgm/sql/pg_trgm_collation.sql
@@ -0,0 +1,24 @@
+/*
+ * This test is for ICU collations.
+ */
+
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- Test that lowercase conversion of trigrams uses specified collation
+
+CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
+INSERT INTO test VALUES ('ISTANBUL');
+SELECT show_trgm(col) FROM test;
+SELECT show_trgm('ISTANBUL' COLLATE "tr-x-icu");
+
+SELECT show_trgm('ISTANBUL' COLLATE "C");
+
+SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
+SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index ca017585369..147eefaa3c6 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -119,8 +119,8 @@ extern double strict_word_similarity_threshold;
extern double index_strategy_get_limit(StrategyNumber strategy);
extern uint32 trgm2int(trgm *ptr);
extern void compact_trigram(trgm *tptr, char *str, int bytelen);
-extern TRGM *generate_trgm(char *str, int slen);
-extern TRGM *generate_wildcard_trgm(const char *str, int slen);
+extern TRGM *generate_trgm(char *str, int slen, Oid collation);
+extern TRGM *generate_wildcard_trgm(const char *str, int slen, Oid collation);
extern float4 cnt_sml(TRGM *trg1, TRGM *trg2, bool inexact);
extern bool trgm_contained_by(TRGM *trg1, TRGM *trg2);
extern bool *trgm_presence_map(TRGM *query, TRGM *key);
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
index 014bb3c848c..125e71ca23a 100644
--- a/contrib/pg_trgm/trgm_gin.c
+++ b/contrib/pg_trgm/trgm_gin.c
@@ -42,7 +42,7 @@ gin_extract_value_trgm(PG_FUNCTION_ARGS)
*nentries = 0;
- trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
trglen = ARRNELEM(trg);
if (trglen > 0)
@@ -93,7 +93,7 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
case WordSimilarityStrategyNumber:
case StrictWordSimilarityStrategyNumber:
case EqualStrategyNumber:
- trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
break;
case ILikeStrategyNumber:
#ifndef IGNORECASE
@@ -107,7 +107,8 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
* potentially-matching string must include.
*/
trg = generate_wildcard_trgm(VARDATA_ANY(val),
- VARSIZE_ANY_EXHDR(val));
+ VARSIZE_ANY_EXHDR(val),
+ PG_GET_COLLATION());
break;
case RegExpICaseStrategyNumber:
#ifndef IGNORECASE
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 2f0d61985a5..22cc2a6f7b1 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -123,7 +123,7 @@ gtrgm_compress(PG_FUNCTION_ARGS)
TRGM *res;
text *val = DatumGetTextPP(entry->key);
- res = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ res = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
retval = palloc_object(GISTENTRY);
gistentryinit(*retval, PointerGetDatum(res),
entry->rel, entry->page,
@@ -242,7 +242,8 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
case StrictWordSimilarityStrategyNumber:
case EqualStrategyNumber:
qtrg = generate_trgm(VARDATA(query),
- querysize - VARHDRSZ);
+ querysize - VARHDRSZ,
+ PG_GET_COLLATION());
break;
case ILikeStrategyNumber:
#ifndef IGNORECASE
@@ -251,7 +252,8 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
/* FALL THRU */
case LikeStrategyNumber:
qtrg = generate_wildcard_trgm(VARDATA(query),
- querysize - VARHDRSZ);
+ querysize - VARHDRSZ,
+ PG_GET_COLLATION());
break;
case RegExpICaseStrategyNumber:
#ifndef IGNORECASE
@@ -475,7 +477,7 @@ gtrgm_distance(PG_FUNCTION_ARGS)
{
char *newcache;
- qtrg = generate_trgm(VARDATA(query), querysize - VARHDRSZ);
+ qtrg = generate_trgm(VARDATA(query), querysize - VARHDRSZ, PG_GET_COLLATION());
newcache = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
MAXALIGN(querysize) +
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index 81182a15e07..d4c75caeff9 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -324,7 +324,7 @@ make_trigrams(trgm *tptr, char *str, int bytelen, int charlen)
* Returns length of the generated array.
*/
static int
-generate_trgm_only(trgm *trg, char *str, int slen, TrgmBound *bounds)
+generate_trgm_only(trgm *trg, char *str, int slen, Oid collation, TrgmBound *bounds)
{
trgm *tptr;
char *buf;
@@ -352,7 +352,7 @@ generate_trgm_only(trgm *trg, char *str, int slen, TrgmBound *bounds)
while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL)
{
#ifdef IGNORECASE
- bword = str_tolower(bword, eword - bword, DEFAULT_COLLATION_OID);
+ bword = str_tolower(bword, eword - bword, collation);
bytelen = strlen(bword);
#else
bytelen = eword - bword;
@@ -405,7 +405,7 @@ protect_out_of_mem(int slen)
* Returns the sorted array of unique trigrams.
*/
TRGM *
-generate_trgm(char *str, int slen)
+generate_trgm(char *str, int slen, Oid collation)
{
TRGM *trg;
int len;
@@ -415,7 +415,7 @@ generate_trgm(char *str, int slen)
trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) * 3);
trg->flag = ARRKEY;
- len = generate_trgm_only(GETARR(trg), str, slen, NULL);
+ len = generate_trgm_only(GETARR(trg), str, slen, collation, NULL);
SET_VARSIZE(trg, CALCGTSIZE(ARRKEY, len));
if (len == 0)
@@ -671,7 +671,7 @@ iterate_word_similarity(int *trg2indexes,
*/
static float4
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
- uint8 flags)
+ uint8 flags, Oid collation)
{
bool *found;
pos_trgm *ptrg;
@@ -697,8 +697,8 @@ calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
else
bounds = NULL;
- len1 = generate_trgm_only(trg1, str1, slen1, NULL);
- len2 = generate_trgm_only(trg2, str2, slen2, bounds);
+ len1 = generate_trgm_only(trg1, str1, slen1, collation, NULL);
+ len2 = generate_trgm_only(trg2, str2, slen2, collation, bounds);
ptrg = make_positional_trgm(trg1, len1, trg2, len2);
len = len1 + len2;
@@ -915,7 +915,7 @@ get_wildcard_part(const char *str, int lenstr,
* " a", "bcd" would be extracted.
*/
TRGM *
-generate_wildcard_trgm(const char *str, int slen)
+generate_wildcard_trgm(const char *str, int slen, Oid collation)
{
TRGM *trg;
char *buf,
@@ -948,7 +948,7 @@ generate_wildcard_trgm(const char *str, int slen)
buf, &bytelen, &charlen)) != NULL)
{
#ifdef IGNORECASE
- buf2 = str_tolower(buf, bytelen, DEFAULT_COLLATION_OID);
+ buf2 = str_tolower(buf, bytelen, collation);
bytelen = strlen(buf2);
#else
buf2 = buf;
@@ -1007,7 +1007,7 @@ show_trgm(PG_FUNCTION_ARGS)
trgm *ptr;
int i;
- trg = generate_trgm(VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in));
+ trg = generate_trgm(VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in), PG_GET_COLLATION());
d = palloc_array(Datum, 1 + ARRNELEM(trg));
for (i = 0, ptr = GETARR(trg); i < ARRNELEM(trg); i++, ptr++)
@@ -1174,8 +1174,8 @@ similarity(PG_FUNCTION_ARGS)
*trg2;
float4 res;
- trg1 = generate_trgm(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1));
- trg2 = generate_trgm(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2));
+ trg1 = generate_trgm(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1), PG_GET_COLLATION());
+ trg2 = generate_trgm(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2), PG_GET_COLLATION());
res = cnt_sml(trg1, trg2, false);
@@ -1196,7 +1196,7 @@ word_similarity(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1212,7 +1212,7 @@ strict_word_similarity(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1222,9 +1222,10 @@ strict_word_similarity(PG_FUNCTION_ARGS)
Datum
similarity_dist(PG_FUNCTION_ARGS)
{
- float4 res = DatumGetFloat4(DirectFunctionCall2(similarity,
- PG_GETARG_DATUM(0),
- PG_GETARG_DATUM(1)));
+ float4 res = DatumGetFloat4(DirectFunctionCall2Coll(similarity,
+ PG_GET_COLLATION(),
+ PG_GETARG_DATUM(0),
+ PG_GETARG_DATUM(1)));
PG_RETURN_FLOAT4(1.0 - res);
}
@@ -1232,9 +1233,10 @@ similarity_dist(PG_FUNCTION_ARGS)
Datum
similarity_op(PG_FUNCTION_ARGS)
{
- float4 res = DatumGetFloat4(DirectFunctionCall2(similarity,
- PG_GETARG_DATUM(0),
- PG_GETARG_DATUM(1)));
+ float4 res = DatumGetFloat4(DirectFunctionCall2Coll(similarity,
+ PG_GET_COLLATION(),
+ PG_GETARG_DATUM(0),
+ PG_GETARG_DATUM(1)));
PG_RETURN_BOOL(res >= similarity_threshold);
}
@@ -1248,7 +1250,7 @@ word_similarity_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_CHECK_ONLY);
+ WORD_SIMILARITY_CHECK_ONLY, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1264,7 +1266,7 @@ word_similarity_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_CHECK_ONLY);
+ WORD_SIMILARITY_CHECK_ONLY, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1280,7 +1282,7 @@ word_similarity_dist_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1296,7 +1298,7 @@ word_similarity_dist_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1312,7 +1314,8 @@ strict_word_similarity_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT,
+ PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1328,7 +1331,8 @@ strict_word_similarity_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT,
+ PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1344,7 +1348,7 @@ strict_word_similarity_dist_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1360,7 +1364,7 @@ strict_word_similarity_dist_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1d1b5fe304d..cdd04fa01ad 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -479,11 +479,11 @@ typedef struct
/* prototypes for private functions */
static TRGM *createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
- MemoryContext rcontext);
+ MemoryContext rcontext, Oid collation);
static void RE_compile(regex_t *regex, text *text_re,
int cflags, Oid collation);
-static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA);
-static bool convertPgWchar(pg_wchar c, trgm_mb_char *result);
+static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation);
+static bool convertPgWchar(pg_wchar c, trgm_mb_char *result, Oid collation);
static void transformGraph(TrgmNFA *trgmNFA);
static void processState(TrgmNFA *trgmNFA, TrgmState *state);
static void addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key);
@@ -551,7 +551,7 @@ createTrgmNFA(text *text_re, Oid collation,
REG_ADVANCED | REG_NOSUB, collation);
#endif
- trg = createTrgmNFAInternal(®ex, graph, rcontext);
+ trg = createTrgmNFAInternal(®ex, graph, rcontext, collation);
/* Clean up all the cruft we created (including regex) */
MemoryContextSwitchTo(oldcontext);
@@ -565,7 +565,7 @@ createTrgmNFA(text *text_re, Oid collation,
*/
static TRGM *
createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
- MemoryContext rcontext)
+ MemoryContext rcontext, Oid collation)
{
TRGM *trg;
TrgmNFA trgmNFA;
@@ -573,7 +573,7 @@ createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
trgmNFA.regex = regex;
/* Collect color information from the regex */
- getColorInfo(regex, &trgmNFA);
+ getColorInfo(regex, &trgmNFA, collation);
#ifdef TRGM_REGEXP_DEBUG
printSourceNFA(regex, trgmNFA.colorInfo, trgmNFA.ncolors);
@@ -762,7 +762,7 @@ RE_compile(regex_t *regex, text *text_re, int cflags, Oid collation)
* Fill TrgmColorInfo structure for each color using regex export functions.
*/
static void
-getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
+getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation)
{
int colorsCount = pg_reg_getnumcolors(regex);
int i;
@@ -808,7 +808,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
{
trgm_mb_char c;
- if (!convertPgWchar(chars[j], &c))
+ if (!convertPgWchar(chars[j], &c, collation))
continue; /* ok to ignore it altogether */
if (ISWORDCHR(c.bytes))
colorInfo->wordChars[colorInfo->wordCharsCount++] = c;
@@ -825,7 +825,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
* Returns false if the character should be ignored completely.
*/
static bool
-convertPgWchar(pg_wchar c, trgm_mb_char *result)
+convertPgWchar(pg_wchar c, trgm_mb_char *result, Oid collation)
{
/* "s" has enough space for a multibyte character and a trailing NUL */
char s[MAX_MULTIBYTE_CHAR_LEN + 1];
@@ -857,7 +857,7 @@ convertPgWchar(pg_wchar c, trgm_mb_char *result)
*/
#ifdef IGNORECASE
{
- char *lowerCased = str_tolower(s, strlen(s), DEFAULT_COLLATION_OID);
+ char *lowerCased = str_tolower(s, strlen(s), collation);
if (strcmp(lowerCased, s) != 0)
{
--
2.51.0
^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Use correct collation in pg_trgm
2026-01-26 13:33 ` Re: Use correct collation in pg_trgm Alexander Korotkov <[email protected]>
2026-01-30 08:42 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
2026-02-02 20:33 ` Re: Use correct collation in pg_trgm Zsolt Parragi <[email protected]>
2026-02-03 11:03 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
@ 2026-03-02 13:47 ` David Geier <[email protected]>
0 siblings, 0 replies; 7+ messages in thread
From: David Geier @ 2026-03-02 13:47 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: Alexander Korotkov <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Kirill Reshke <[email protected]>
> I've rebased the patches on latest master.
Attached is v6 of the patch which fixes a problem with the tests:
pg_trgm_collation_1.out mistakenly contained \endif at the end of the file.
--
David Geier
Attachments:
[text/x-patch] v6-0002-Use-correct-collation-for-finding-word-boundaries.patch (9.7K, 2-v6-0002-Use-correct-collation-for-finding-word-boundaries.patch)
download | inline diff:
From 916e50174bf2df7fbe138803fbbd8f9b435dd2d8 Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Fri, 23 Jan 2026 15:39:06 +0100
Subject: [PATCH v6 2/2] Use correct collation for finding word boundaries
pg_trgm finds all words in the input string and creates trigrams for
them. Word characters are alpha-numeric characters. What qualifies as
alpha-numeric character depends on the collation. Previously, pg_trgm
always used the default collation. Now the specified collation is used
instead.
---
.../pg_trgm/expected/pg_trgm_collation.out | 13 ++++++++++++
contrib/pg_trgm/sql/pg_trgm_collation.sql | 5 +++++
contrib/pg_trgm/trgm.h | 6 +++---
contrib/pg_trgm/trgm_op.c | 20 +++++++++----------
contrib/pg_trgm/trgm_regexp.c | 2 +-
src/backend/tsearch/ts_locale.c | 8 ++++----
src/include/tsearch/ts_locale.h | 3 ++-
7 files changed, 38 insertions(+), 19 deletions(-)
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation.out b/contrib/pg_trgm/expected/pg_trgm_collation.out
index 472ce867665..0cc53edd821 100644
--- a/contrib/pg_trgm/expected/pg_trgm_collation.out
+++ b/contrib/pg_trgm/expected/pg_trgm_collation.out
@@ -41,3 +41,16 @@ SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
0.54545456
(1 row)
+-- Test that word boundary identification uses specified collation
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "tr-x-icu");
+ show_trgm
+-------------------------------------------------------------------------------------------------------------------------------
+ {0x8fc0a2,0x93dfbf,0x1bf43c," h"," he",0x22d44f,0x4398ff,cod,"de ",dic,ell,est,hel,ico,ldi,llo,ode,orl,0x71b8f5,rld,tes,wor}
+(1 row)
+
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "C");
+ show_trgm
+-------------------------------------------------------------------------------------------------------------
+ {" c"," h"," t"," co"," he"," te",cod,"de ",ell,est,hel,iwo,"ld ",llo,"lo ",ode,orl,rld,sti,tes,tiw,wor}
+(1 row)
+
diff --git a/contrib/pg_trgm/sql/pg_trgm_collation.sql b/contrib/pg_trgm/sql/pg_trgm_collation.sql
index afb3973a8b3..e1a5c7c5fa8 100644
--- a/contrib/pg_trgm/sql/pg_trgm_collation.sql
+++ b/contrib/pg_trgm/sql/pg_trgm_collation.sql
@@ -22,3 +22,8 @@ SELECT show_trgm('ISTANBUL' COLLATE "C");
SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+-- Test that word boundary identification uses specified collation
+
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "tr-x-icu");
+SELECT show_trgm('helloıtestIworldİcode' COLLATE "C");
+
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index b6911e91458..3c4db129e20 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -47,9 +47,9 @@ typedef char trgm[3];
} while(0)
extern int (*CMPTRGM) (const void *a, const void *b);
-#define ISWORDCHR(c, len) (t_isalnum_with_len(c, len))
-#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
-#define ISPRINTABLETRGM(t) ( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
+#define ISWORDCHR(c, len, collation) (t_isalnum_with_len_collation(c, len, collation))
+#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
+#define ISPRINTABLETRGM(t) ( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
#define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
#define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index 5f2defb57f2..ca30cb7c363 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -292,7 +292,7 @@ comp_trgm(const void *a, const void *b)
* endword points to the character after word
*/
static char *
-find_word(char *str, int lenstr, char **endword)
+find_word(char *str, int lenstr, char **endword, Oid collation)
{
char *beginword = str;
const char *endstr = str + lenstr;
@@ -301,7 +301,7 @@ find_word(char *str, int lenstr, char **endword)
{
int clen = pg_mblen_range(beginword, endstr);
- if (ISWORDCHR(beginword, clen))
+ if (ISWORDCHR(beginword, clen, collation))
break;
beginword += clen;
}
@@ -314,7 +314,7 @@ find_word(char *str, int lenstr, char **endword)
{
int clen = pg_mblen_range(*endword, endstr);
- if (!ISWORDCHR(*endword, clen))
+ if (!ISWORDCHR(*endword, clen, collation))
break;
*endword += clen;
}
@@ -490,7 +490,7 @@ generate_trgm_only(growable_trgm_array *dst, char *str, int slen, Oid collation,
}
eword = str;
- while ((bword = find_word(eword, slen - (eword - str), &eword)) != NULL)
+ while ((bword = find_word(eword, slen - (eword - str), &eword, collation)) != NULL)
{
int oldlen;
@@ -907,7 +907,7 @@ calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
*/
static const char *
get_wildcard_part(const char *str, int lenstr,
- char *buf, int *bytelen)
+ char *buf, int *bytelen, Oid collation)
{
const char *beginword = str;
const char *endword;
@@ -930,7 +930,7 @@ get_wildcard_part(const char *str, int lenstr,
if (in_escape)
{
- if (ISWORDCHR(beginword, clen))
+ if (ISWORDCHR(beginword, clen, collation))
break;
in_escape = false;
in_leading_wildcard_meta = false;
@@ -941,7 +941,7 @@ get_wildcard_part(const char *str, int lenstr,
in_escape = true;
else if (ISWILDCARDCHAR(beginword))
in_leading_wildcard_meta = true;
- else if (ISWORDCHR(beginword, clen))
+ else if (ISWORDCHR(beginword, clen, collation))
break;
else
in_leading_wildcard_meta = false;
@@ -979,7 +979,7 @@ get_wildcard_part(const char *str, int lenstr,
clen = pg_mblen_range(endword, endstr);
if (in_escape)
{
- if (ISWORDCHR(endword, clen))
+ if (ISWORDCHR(endword, clen, collation))
{
memcpy(s, endword, clen);
s += clen;
@@ -1006,7 +1006,7 @@ get_wildcard_part(const char *str, int lenstr,
in_trailing_wildcard_meta = true;
break;
}
- else if (ISWORDCHR(endword, clen))
+ else if (ISWORDCHR(endword, clen, collation))
{
memcpy(s, endword, clen);
s += clen;
@@ -1070,7 +1070,7 @@ generate_wildcard_trgm(const char *str, int slen, Oid collation)
*/
eword = str;
while ((eword = get_wildcard_part(eword, slen - (eword - str),
- buf, &bytelen)) != NULL)
+ buf, &bytelen, collation)) != NULL)
{
char *word;
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 2f190df2f65..57f7b12c3d9 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -811,7 +811,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation)
if (!clen)
continue; /* ok to ignore it altogether */
- if (ISWORDCHR(c.bytes, clen))
+ if (ISWORDCHR(c.bytes, clen, collation))
colorInfo->wordChars[colorInfo->wordCharsCount++] = c;
else
colorInfo->containsNonWord = true;
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index df02ffb12fd..6f331e054a2 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -26,27 +26,27 @@ static void tsearch_readline_callback(void *arg);
#define GENERATE_T_ISCLASS_DEF(character_class) \
/* mblen shall be that of the first character */ \
int \
-t_is##character_class##_with_len(const char *ptr, int mblen) \
+t_is##character_class##_with_len_collation(const char *ptr, int mblen, Oid collation) \
{ \
pg_wchar wstr[WC_BUF_LEN]; \
int wlen pg_attribute_unused(); \
wlen = pg_mb2wchar_with_len(ptr, wstr, mblen); \
Assert(wlen <= 1); \
/* pass single character, or NUL if empty */ \
- return pg_isw##character_class(wstr[0], pg_database_locale()); \
+ return pg_isw##character_class(wstr[0], pg_newlocale_from_collation(collation)); \
} \
\
/* ptr shall point to a NUL-terminated string */ \
int \
t_is##character_class##_cstr(const char *ptr) \
{ \
- return t_is##character_class##_with_len(ptr, pg_mblen_cstr(ptr)); \
+ return t_is##character_class##_with_len_collation(ptr, pg_mblen_cstr(ptr), DEFAULT_COLLATION_OID); \
} \
/* ptr shall point to a string with pre-validated encoding */ \
int \
t_is##character_class##_unbounded(const char *ptr) \
{ \
- return t_is##character_class##_with_len(ptr, pg_mblen_unbounded(ptr)); \
+ return t_is##character_class##_with_len_collation(ptr, pg_mblen_unbounded(ptr), DEFAULT_COLLATION_OID); \
} \
/* historical name for _unbounded */ \
int \
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index 6e2d67ee4a5..7ad7042d523 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -18,6 +18,7 @@
#include "lib/stringinfo.h"
#include "mb/pg_wchar.h"
+#include "catalog/pg_collation.h"
#include "utils/pg_locale.h"
/* working state for tsearch_readline (should be a local var in caller) */
@@ -56,7 +57,7 @@ ts_copychar_cstr(void *dest, const void *src)
#define COPYCHAR ts_copychar_cstr
#define GENERATE_T_ISCLASS_DECL(character_class) \
-extern int t_is##character_class##_with_len(const char *ptr, int len); \
+extern int t_is##character_class##_with_len_collation(const char *ptr, int len, Oid collation); \
extern int t_is##character_class##_cstr(const char *ptr); \
extern int t_is##character_class##_unbounded(const char *ptr); \
\
--
2.51.0
[text/x-patch] v6-0001-Use-correct-collation-for-lowercasing.patch (19.4K, 3-v6-0001-Use-correct-collation-for-lowercasing.patch)
download | inline diff:
From 3bdef59a6f4ec89b7855dbff1b560fb8753a3130 Mon Sep 17 00:00:00 2001
From: David Geier <[email protected]>
Date: Wed, 21 Jan 2026 14:54:28 +0100
Subject: [PATCH v6 1/2] Use correct collation for lowercasing
pg_trgm converts the input words to lowercase before extracting the
trigrams. The lowercase conversion depends on the collation. Previously,
pg_trgm always used the default collation. Now, the specified collation
is used instead.
---
contrib/pg_trgm/Makefile | 2 +-
.../pg_trgm/expected/pg_trgm_collation.out | 43 +++++++++++++
.../pg_trgm/expected/pg_trgm_collation_1.out | 9 +++
contrib/pg_trgm/meson.build | 1 +
contrib/pg_trgm/sql/pg_trgm_collation.sql | 24 ++++++++
contrib/pg_trgm/trgm.h | 4 +-
contrib/pg_trgm/trgm_gin.c | 7 ++-
contrib/pg_trgm/trgm_gist.c | 10 ++--
contrib/pg_trgm/trgm_op.c | 60 ++++++++++---------
contrib/pg_trgm/trgm_regexp.c | 20 +++----
10 files changed, 132 insertions(+), 48 deletions(-)
create mode 100644 contrib/pg_trgm/expected/pg_trgm_collation.out
create mode 100644 contrib/pg_trgm/expected/pg_trgm_collation_1.out
create mode 100644 contrib/pg_trgm/sql/pg_trgm_collation.sql
diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
index c1756993ec7..26b3028b75e 100644
--- a/contrib/pg_trgm/Makefile
+++ b/contrib/pg_trgm/Makefile
@@ -14,7 +14,7 @@ DATA = pg_trgm--1.5--1.6.sql pg_trgm--1.4--1.5.sql pg_trgm--1.3--1.4.sql \
pg_trgm--1.0--1.1.sql
PGFILEDESC = "pg_trgm - trigram matching"
-REGRESS = pg_trgm pg_utf8_trgm pg_word_trgm pg_strict_word_trgm
+REGRESS = pg_trgm pg_utf8_trgm pg_word_trgm pg_strict_word_trgm pg_trgm_collation
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation.out b/contrib/pg_trgm/expected/pg_trgm_collation.out
new file mode 100644
index 00000000000..472ce867665
--- /dev/null
+++ b/contrib/pg_trgm/expected/pg_trgm_collation.out
@@ -0,0 +1,43 @@
+/*
+ * This test is for ICU collations.
+ */
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- Test that lowercase conversion of trigrams uses specified collation
+CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
+INSERT INTO test VALUES ('ISTANBUL');
+SELECT show_trgm(col) FROM test;
+ show_trgm
+--------------------------------------------------------
+ {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT show_trgm('ISTANBUL' COLLATE "tr-x-icu");
+ show_trgm
+--------------------------------------------------------
+ {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT show_trgm('ISTANBUL' COLLATE "C");
+ show_trgm
+---------------------------------------------
+ {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
+(1 row)
+
+SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
+ similarity
+------------
+ 1
+(1 row)
+
+SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+ similarity
+------------
+ 0.54545456
+(1 row)
+
diff --git a/contrib/pg_trgm/expected/pg_trgm_collation_1.out b/contrib/pg_trgm/expected/pg_trgm_collation_1.out
new file mode 100644
index 00000000000..25c99c4abf0
--- /dev/null
+++ b/contrib/pg_trgm/expected/pg_trgm_collation_1.out
@@ -0,0 +1,9 @@
+/*
+ * This test is for ICU collations.
+ */
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_trgm/meson.build b/contrib/pg_trgm/meson.build
index 3ecf95ba862..5eafa774435 100644
--- a/contrib/pg_trgm/meson.build
+++ b/contrib/pg_trgm/meson.build
@@ -42,6 +42,7 @@ tests += {
'pg_utf8_trgm',
'pg_word_trgm',
'pg_strict_word_trgm',
+ 'pg_trgm_collation',
],
},
}
diff --git a/contrib/pg_trgm/sql/pg_trgm_collation.sql b/contrib/pg_trgm/sql/pg_trgm_collation.sql
new file mode 100644
index 00000000000..afb3973a8b3
--- /dev/null
+++ b/contrib/pg_trgm/sql/pg_trgm_collation.sql
@@ -0,0 +1,24 @@
+/*
+ * This test is for ICU collations.
+ */
+
+/* skip test if not UTF8 server encoding or no ICU collations installed */
+SELECT getdatabaseencoding() <> 'UTF8' OR
+ (SELECT count(*) FROM pg_collation WHERE collprovider = 'i' AND collname <> 'unicode') = 0
+ AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- Test that lowercase conversion of trigrams uses specified collation
+
+CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
+INSERT INTO test VALUES ('ISTANBUL');
+SELECT show_trgm(col) FROM test;
+SELECT show_trgm('ISTANBUL' COLLATE "tr-x-icu");
+
+SELECT show_trgm('ISTANBUL' COLLATE "C");
+
+SELECT similarity('ıstanbul' COLLATE "tr-x-icu", 'ISTANBUL' COLLATE "tr-x-icu");
+SELECT similarity('ıstanbul' COLLATE "C", 'ISTANBUL' COLLATE "C");
+
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index ca23aad4dd9..b6911e91458 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -119,8 +119,8 @@ extern double strict_word_similarity_threshold;
extern double index_strategy_get_limit(StrategyNumber strategy);
extern uint32 trgm2int(trgm *ptr);
extern void compact_trigram(trgm *tptr, char *str, int bytelen);
-extern TRGM *generate_trgm(char *str, int slen);
-extern TRGM *generate_wildcard_trgm(const char *str, int slen);
+extern TRGM *generate_trgm(char *str, int slen, Oid collation);
+extern TRGM *generate_wildcard_trgm(const char *str, int slen, Oid collation);
extern float4 cnt_sml(TRGM *trg1, TRGM *trg2, bool inexact);
extern bool trgm_contained_by(TRGM *trg1, TRGM *trg2);
extern bool *trgm_presence_map(TRGM *query, TRGM *key);
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
index 5766b3e9955..14a892c657d 100644
--- a/contrib/pg_trgm/trgm_gin.c
+++ b/contrib/pg_trgm/trgm_gin.c
@@ -42,7 +42,7 @@ gin_extract_value_trgm(PG_FUNCTION_ARGS)
*nentries = 0;
- trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
trglen = ARRNELEM(trg);
if (trglen > 0)
@@ -93,7 +93,7 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
case WordSimilarityStrategyNumber:
case StrictWordSimilarityStrategyNumber:
case EqualStrategyNumber:
- trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ trg = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
break;
case ILikeStrategyNumber:
#ifndef IGNORECASE
@@ -107,7 +107,8 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
* potentially-matching string must include.
*/
trg = generate_wildcard_trgm(VARDATA_ANY(val),
- VARSIZE_ANY_EXHDR(val));
+ VARSIZE_ANY_EXHDR(val),
+ PG_GET_COLLATION());
break;
case RegExpICaseStrategyNumber:
#ifndef IGNORECASE
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 11812b2984e..d9102400442 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -123,7 +123,7 @@ gtrgm_compress(PG_FUNCTION_ARGS)
TRGM *res;
text *val = DatumGetTextPP(entry->key);
- res = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val));
+ res = generate_trgm(VARDATA_ANY(val), VARSIZE_ANY_EXHDR(val), PG_GET_COLLATION());
retval = palloc_object(GISTENTRY);
gistentryinit(*retval, PointerGetDatum(res),
entry->rel, entry->page,
@@ -242,7 +242,8 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
case StrictWordSimilarityStrategyNumber:
case EqualStrategyNumber:
qtrg = generate_trgm(VARDATA(query),
- querysize - VARHDRSZ);
+ querysize - VARHDRSZ,
+ PG_GET_COLLATION());
break;
case ILikeStrategyNumber:
#ifndef IGNORECASE
@@ -251,7 +252,8 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
pg_fallthrough;
case LikeStrategyNumber:
qtrg = generate_wildcard_trgm(VARDATA(query),
- querysize - VARHDRSZ);
+ querysize - VARHDRSZ,
+ PG_GET_COLLATION());
break;
case RegExpICaseStrategyNumber:
#ifndef IGNORECASE
@@ -475,7 +477,7 @@ gtrgm_distance(PG_FUNCTION_ARGS)
{
char *newcache;
- qtrg = generate_trgm(VARDATA(query), querysize - VARHDRSZ);
+ qtrg = generate_trgm(VARDATA(query), querysize - VARHDRSZ, PG_GET_COLLATION());
newcache = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
MAXALIGN(querysize) +
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index ee89e548d16..5f2defb57f2 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -447,7 +447,7 @@ done:
* bounds_p: where to return bounds of trigrams (if needed).
*/
static void
-generate_trgm_only(growable_trgm_array *dst, char *str, int slen, TrgmBound **bounds_p)
+generate_trgm_only(growable_trgm_array *dst, char *str, int slen, Oid collation, TrgmBound **bounds_p)
{
size_t buflen;
char *buf;
@@ -499,7 +499,7 @@ generate_trgm_only(growable_trgm_array *dst, char *str, int slen, TrgmBound **bo
{
char *lowered;
- lowered = str_tolower(bword, eword - bword, DEFAULT_COLLATION_OID);
+ lowered = str_tolower(bword, eword - bword, collation);
bytelen = strlen(lowered);
/* grow the buffer if necessary */
@@ -553,13 +553,13 @@ generate_trgm_only(growable_trgm_array *dst, char *str, int slen, TrgmBound **bo
* Returns the sorted array of unique trigrams.
*/
TRGM *
-generate_trgm(char *str, int slen)
+generate_trgm(char *str, int slen, Oid collation)
{
TRGM *trg;
growable_trgm_array arr;
int len;
- generate_trgm_only(&arr, str, slen, NULL);
+ generate_trgm_only(&arr, str, slen, collation, NULL);
len = arr.length;
trg = arr.datum;
trg->flag = ARRKEY;
@@ -814,7 +814,7 @@ iterate_word_similarity(int *trg2indexes,
*/
static float4
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
- uint8 flags)
+ uint8 flags, Oid collation)
{
bool *found;
pos_trgm *ptrg;
@@ -832,9 +832,9 @@ calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
/* Make positional trigrams */
- generate_trgm_only(&trg1, str1, slen1, NULL);
+ generate_trgm_only(&trg1, str1, slen1, collation, NULL);
len1 = trg1.length;
- generate_trgm_only(&trg2, str2, slen2, (flags & WORD_SIMILARITY_STRICT) ? &bounds : NULL);
+ generate_trgm_only(&trg2, str2, slen2, collation, (flags & WORD_SIMILARITY_STRICT) ? &bounds : NULL);
len2 = trg2.length;
ptrg = make_positional_trgm(GETARR(trg1.datum), len1, GETARR(trg2.datum), len2);
@@ -1043,7 +1043,7 @@ get_wildcard_part(const char *str, int lenstr,
* " a", "bcd" would be extracted.
*/
TRGM *
-generate_wildcard_trgm(const char *str, int slen)
+generate_wildcard_trgm(const char *str, int slen, Oid collation)
{
TRGM *trg;
growable_trgm_array arr;
@@ -1075,7 +1075,7 @@ generate_wildcard_trgm(const char *str, int slen)
char *word;
#ifdef IGNORECASE
- word = str_tolower(buf, bytelen, DEFAULT_COLLATION_OID);
+ word = str_tolower(buf, bytelen, collation);
bytelen = strlen(word);
#else
word = buf;
@@ -1134,7 +1134,7 @@ show_trgm(PG_FUNCTION_ARGS)
trgm *ptr;
int i;
- trg = generate_trgm(VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in));
+ trg = generate_trgm(VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in), PG_GET_COLLATION());
d = palloc_array(Datum, 1 + ARRNELEM(trg));
for (i = 0, ptr = GETARR(trg); i < ARRNELEM(trg); i++, ptr++)
@@ -1301,8 +1301,8 @@ similarity(PG_FUNCTION_ARGS)
*trg2;
float4 res;
- trg1 = generate_trgm(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1));
- trg2 = generate_trgm(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2));
+ trg1 = generate_trgm(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1), PG_GET_COLLATION());
+ trg2 = generate_trgm(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2), PG_GET_COLLATION());
res = cnt_sml(trg1, trg2, false);
@@ -1323,7 +1323,7 @@ word_similarity(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1339,7 +1339,7 @@ strict_word_similarity(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1349,9 +1349,10 @@ strict_word_similarity(PG_FUNCTION_ARGS)
Datum
similarity_dist(PG_FUNCTION_ARGS)
{
- float4 res = DatumGetFloat4(DirectFunctionCall2(similarity,
- PG_GETARG_DATUM(0),
- PG_GETARG_DATUM(1)));
+ float4 res = DatumGetFloat4(DirectFunctionCall2Coll(similarity,
+ PG_GET_COLLATION(),
+ PG_GETARG_DATUM(0),
+ PG_GETARG_DATUM(1)));
PG_RETURN_FLOAT4(1.0 - res);
}
@@ -1359,9 +1360,10 @@ similarity_dist(PG_FUNCTION_ARGS)
Datum
similarity_op(PG_FUNCTION_ARGS)
{
- float4 res = DatumGetFloat4(DirectFunctionCall2(similarity,
- PG_GETARG_DATUM(0),
- PG_GETARG_DATUM(1)));
+ float4 res = DatumGetFloat4(DirectFunctionCall2Coll(similarity,
+ PG_GET_COLLATION(),
+ PG_GETARG_DATUM(0),
+ PG_GETARG_DATUM(1)));
PG_RETURN_BOOL(res >= similarity_threshold);
}
@@ -1375,7 +1377,7 @@ word_similarity_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_CHECK_ONLY);
+ WORD_SIMILARITY_CHECK_ONLY, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1391,7 +1393,7 @@ word_similarity_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_CHECK_ONLY);
+ WORD_SIMILARITY_CHECK_ONLY, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1407,7 +1409,7 @@ word_similarity_dist_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1423,7 +1425,7 @@ word_similarity_dist_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- 0);
+ 0, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1439,7 +1441,8 @@ strict_word_similarity_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT,
+ PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1455,7 +1458,8 @@ strict_word_similarity_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_CHECK_ONLY | WORD_SIMILARITY_STRICT,
+ PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1471,7 +1475,7 @@ strict_word_similarity_dist_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
@@ -1487,7 +1491,7 @@ strict_word_similarity_dist_commutator_op(PG_FUNCTION_ARGS)
res = calc_word_similarity(VARDATA_ANY(in2), VARSIZE_ANY_EXHDR(in2),
VARDATA_ANY(in1), VARSIZE_ANY_EXHDR(in1),
- WORD_SIMILARITY_STRICT);
+ WORD_SIMILARITY_STRICT, PG_GET_COLLATION());
PG_FREE_IF_COPY(in1, 0);
PG_FREE_IF_COPY(in2, 1);
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index efee4cf5fb4..2f190df2f65 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -479,11 +479,11 @@ typedef struct
/* prototypes for private functions */
static TRGM *createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
- MemoryContext rcontext);
+ MemoryContext rcontext, Oid collation);
static void RE_compile(regex_t *regex, text *text_re,
int cflags, Oid collation);
-static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA);
-static int convertPgWchar(pg_wchar c, trgm_mb_char *result);
+static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation);
+static int convertPgWchar(pg_wchar c, trgm_mb_char *result, Oid collation);
static void transformGraph(TrgmNFA *trgmNFA);
static void processState(TrgmNFA *trgmNFA, TrgmState *state);
static void addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key);
@@ -551,7 +551,7 @@ createTrgmNFA(text *text_re, Oid collation,
REG_ADVANCED | REG_NOSUB, collation);
#endif
- trg = createTrgmNFAInternal(®ex, graph, rcontext);
+ trg = createTrgmNFAInternal(®ex, graph, rcontext, collation);
/* Clean up all the cruft we created (including regex) */
MemoryContextSwitchTo(oldcontext);
@@ -565,7 +565,7 @@ createTrgmNFA(text *text_re, Oid collation,
*/
static TRGM *
createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
- MemoryContext rcontext)
+ MemoryContext rcontext, Oid collation)
{
TRGM *trg;
TrgmNFA trgmNFA;
@@ -573,7 +573,7 @@ createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
trgmNFA.regex = regex;
/* Collect color information from the regex */
- getColorInfo(regex, &trgmNFA);
+ getColorInfo(regex, &trgmNFA, collation);
#ifdef TRGM_REGEXP_DEBUG
printSourceNFA(regex, trgmNFA.colorInfo, trgmNFA.ncolors);
@@ -762,7 +762,7 @@ RE_compile(regex_t *regex, text *text_re, int cflags, Oid collation)
* Fill TrgmColorInfo structure for each color using regex export functions.
*/
static void
-getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
+getColorInfo(regex_t *regex, TrgmNFA *trgmNFA, Oid collation)
{
int colorsCount = pg_reg_getnumcolors(regex);
int i;
@@ -807,7 +807,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
for (j = 0; j < charsCount; j++)
{
trgm_mb_char c;
- int clen = convertPgWchar(chars[j], &c);
+ int clen = convertPgWchar(chars[j], &c, collation);
if (!clen)
continue; /* ok to ignore it altogether */
@@ -827,7 +827,7 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
* byte length.
*/
static int
-convertPgWchar(pg_wchar c, trgm_mb_char *result)
+convertPgWchar(pg_wchar c, trgm_mb_char *result, Oid collation)
{
/* "s" has enough space for a multibyte character and a trailing NUL */
char s[MAX_MULTIBYTE_CHAR_LEN + 1];
@@ -860,7 +860,7 @@ convertPgWchar(pg_wchar c, trgm_mb_char *result)
*/
#ifdef IGNORECASE
{
- char *lowerCased = str_tolower(s, clen, DEFAULT_COLLATION_OID);
+ char *lowerCased = str_tolower(s, clen, collation);
if (strcmp(lowerCased, s) != 0)
{
--
2.51.0
^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Use correct collation in pg_trgm
@ 2026-03-26 08:50 ` David Geier <[email protected]>
2026-03-27 10:56 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
1 sibling, 1 reply; 7+ messages in thread
From: David Geier @ 2026-03-26 08:50 UTC (permalink / raw)
To: Jeff Davis <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>
Hi Jeff!
> This area is a bit awkward conceptually. The case you found is not
> about the *sort order* of the values; it's about the casing semantics.
> We mix those two concepts into a single "collation oid" that determines
> both sort order and casing semantics (and pattern matching semantics,
> too).
>
> LOWER() and UPPER() take the casing semantics from the inferred
> collation, so that's a good argument that you're doing the right thing
> here.
>
> But full text search does not; it uses DEFAULT_COLLATION_OID for
> parsing the input. That sort of makes sense, because tsvector/tsquery
> don't have a collatable sort order -- it's more about the parsing
> semantics to create the values in the first place, not about how the
> tsvector/tsquery values are sorted.
For pg_trgm it's also not only about casing but also about parsing: the
decision of what is considered an alpha-numeric character in ISWORDCHR()
depends on the collation.
> So that leaves me wondering: why would pg_trgm use the inferred
> collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem
> conceptually similar, and the only real difference I see is that
> tsvector/tsquery are types and pg_trgm is a set of functions.
I agree. That is inconsistent. But if anything, shouldn't we change
tsvector/tsquery to as well adhere to the inferred collation?
For example, when a user specifies a collation for some table column, he
expects the collation to not only impact sort order. With say collation
en-US-x-icu, the B-tree lookup will be case-insensitive. Why would a GIN
index suddenly not adhere to the collation? That seems counter-intuitive
and confusing. The same when using tsvector/tsquery.
More generally: shouldn't it, from a user's point-of-view, be an all or
nothing to avoid surprises?
If not, we should come up with easy to understand and easy to remember
reasons for what adheres to the inferred collation and what adheres the
default collation and document that.
> Note that I made some changes here recently: full text search and ltree
> used to use libc unconditionally or a mix of libc and
> DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to
> consistently use DEFAULT_COLLATION_OID. But I didn't resolve the
> conceptual problem of whether we should use the inferred collation (as
> you suggest) or not.
Thanks for the heads up.
--
David Geier
^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Use correct collation in pg_trgm
2026-03-26 08:50 ` Re: Use correct collation in pg_trgm David Geier <[email protected]>
@ 2026-03-27 10:56 ` David Geier <[email protected]>
0 siblings, 0 replies; 7+ messages in thread
From: David Geier @ 2026-03-27 10:56 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Jeff Davis <[email protected]>; +Cc: pgsql-hackers; Heikki Linnakangas <[email protected]>; Peter Eisentraut <[email protected]>
On 26.03.2026 19:26, Tom Lane wrote:
> Jeff Davis <[email protected]> writes:
>> On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
>>> I agree. That is inconsistent. But if anything, shouldn't we change
>>> tsvector/tsquery to as well adhere to the inferred collation?
>
>> I am not sure either way.
>> It's easy to specify a COLLATE clause to affect the interpretation of
>> the input. But once you parse the inputs into a stored value, you can't
>> later reinterpret those values by specifying a COLLATE clause. The
>> parsing already happened and the original input string was lost.
>> You can end up with a table full of values, some of which were parsed
>> with one set of semantics, and others parsed with a different set of
>> semantics. That may make sense or it may just cause confusion. It's
>> tough for me to say.
>
> The rule that text search goes by is that it's okay to be a bit
> fuzzy about this because people are usually looking for approximate
> matches, so that even if you have sets of lexemes that were extracted
> under slightly different parsing rules you can probably still find
tsquery allows to do starts-with queries equivalent to LIKE 'foo%' via
to_tsquery('foo:*'). These two would then also behave differently.
Can you see any good reason that speaks against using the inferred
collation in tsquery / tsvector?
> what you want. While that argument still works for pg_trgm's original
> "similarity" functions, it falls flat for the LIKE/ILIKE/regex index
> support functionality: people will be justifiably unhappy if the index
> doesn't find the exact same matches that a seqscan-and-filter would.
Agreed. That was also one of the motivations to change it.
> I've not experimented, but I rather imagine that things are already
> buggy as heck, in that optimizing a LIKE or regex expression that's
> got collation A applied to it into an indexscan on a pg_trgm index
> made with collation B will not work if different trigrams get
> extracted. I think we have to insist that the index collation match
> the query. Once we've done that, the concern about making a change
> like this seems less: you will not get wrong answers, rather the
> planner will refuse to use an incompatible index.
I thought that happens already. In the following example no index scan
is used, even though sequential scan is disabled. FWICS,
IndexCollMatchesExprColl() takes care of that.
CREATE EXTENSION pg_trgm;
CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
CREATE INDEX ON test USING GIN(col gin_trgm_ops);
SET enable_seqscan = FALSE;
EXPLAIN SELECT * FROM test WHERE col LIKE '%test%' COLLATE "C";
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000000001.01 rows=1 width=32)
Filter: (col = 'test'::text COLLATE "C")
If you have other cases in mind, pointers are appreciated.
--
David Geier
^ permalink raw reply [nested|flat] 7+ messages in thread
end of thread, other threads:[~2026-03-27 10:56 UTC | newest]
Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-26 13:33 ` Alexander Korotkov <[email protected]>
2026-01-30 08:42 ` David Geier <[email protected]>
2026-02-02 20:33 ` Zsolt Parragi <[email protected]>
2026-02-03 11:03 ` David Geier <[email protected]>
2026-03-02 13:47 ` David Geier <[email protected]>
2026-03-26 08:50 ` David Geier <[email protected]>
2026-03-27 10:56 ` David Geier <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox