Re: Reduce build times of pg_trgm GIN indexes

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Heikki Linnakangas <[email protected]>
To: David Geier <[email protected]>
To: pgsql-hackers <[email protected]>
Subject: Re: Reduce build times of pg_trgm GIN indexes
Date: Tue, 13 Jan 2026 00:10:03 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>

On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
> 
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
> 
> postgres=# select lower('III' COLLATE "tr_TR");
>   lower
> -------
>   ııı
> 
> postgres=# select show_trgm('III' COLLATE "tr_TR");
>          show_trgm
> -------------------------
>   {"  i"," ii","ii ",iii}
> (1 row)
> 
> But when using tr_TR as default locale of the database the following
> happens:
> 
> postgres=# select lower('III' COLLATE "tr_TR");
>   lower
> -------
>   ııı
> 
> postgres=# select show_trgm('III');sü
>                 show_trgm
> ---------------------------------------
>   {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
> 
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
> 
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.

Hmm, yeah, that feels funny. The trigram code predates per-column 
collation support, so I guess we never really thought through how it 
should interact with COLLATE clauses.

> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.

I think that's only possible for libc locales, which operate one 
character at a time. In ICU locales, lower-casing a character can depend 
on the surrounding characters, so you cannot just test the conversion of 
every ascii character individually. It would make sense for libc locales 
though, and I hope the ICU functions are a little faster anyway.

Although, we probably should be using case-folding rather than 
lower-casing with ICU locales anyway. Case-folding is designed for 
string matching. It'd be a backwards-compatibility breaking change, though.

- Heikki

view thread (3+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Reduce build times of pg_trgm GIN indexes
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox