Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vfQ76-001lRH-1I for pgsql-hackers@arkaria.postgresql.org; Mon, 12 Jan 2026 22:10:17 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vfQ74-001gjL-19 for pgsql-hackers@arkaria.postgresql.org; Mon, 12 Jan 2026 22:10:14 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vfQ73-001gjD-35 for pgsql-hackers@lists.postgresql.org; Mon, 12 Jan 2026 22:10:14 +0000 Received: from lahtoruutu.iki.fi ([2a0b:5c81:1c1::37]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vfQ72-0006GN-0K for pgsql-hackers@postgresql.org; Mon, 12 Jan 2026 22:10:13 +0000 Received: from [10.0.2.15] (unknown [130.41.208.2]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: hlinnaka) by lahtoruutu.iki.fi (Postfix) with ESMTPSA id 4dqmlP3f3gz49Q1P; Tue, 13 Jan 2026 00:10:09 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=lahtoruutu; t=1768255809; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HNW9eK3HC4NTLnqpcQROe5kI2hNU4XXogc646gmQ+Ek=; b=HIHcpQBtGhDEtRwBdZP1geXXo3fQZDtJnlqBzESd3e3YrHMbKD5y2wegNpuzn1MrvEaeRX nwUSTpTHgaO3HdzV+wQ3rYzrVY2rtL6Ely8OuFu31WkxUyxSndcVI/A199Pkva8q7O+xKV /+YAHQn89CeoJScruR46+3NTQpRlWoyptk7UZvtSStMAUQOs0/CXBGJmOK5SyXWS1pjcoW 63sXTTMEEwfMK9amNZ7GozY83UaugYxNYCBSjHBppsLDpmXBZIT4os42wFPlTXw1TLUGh6 g+vULI3UVTTdiqRXGqv9dHDwKGby3xmxmavx6u4k6uDnHs+L9Y/S+9vrVBbslw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=lahtoruutu; t=1768255809; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HNW9eK3HC4NTLnqpcQROe5kI2hNU4XXogc646gmQ+Ek=; b=e2VZqf4gjnD4A948VYTOMyetruGdyzuukri9zNTtCuF7NGuhZ5LI9fyrnEPEVfbRbJ6txn ppak9YMo0o61/y4W6+VMt+e+gNd4ja6f3xjPPimUVNQvzNLVP41AFgoV6TT1BqNNlZhN5b IBq5hpAvRcWoYnn03seM+AneyguOHdqBWs/wEGhLZYwNFt9jq287/8fXQEo6aThi8hUhK8 2cEOY9Bj5QWSwsfryb95qNJDnI3QumFr8tg6EmJRNiGBw9Xq/n4QJg17mYHl/ymfAerSDq WLYOn3O2CZA/+so1w9KPaH++jMrTZigs1z9VSZJzqD8ob2CeQEG3yek0XmThfg== ARC-Seal: i=1; a=rsa-sha256; d=iki.fi; s=lahtoruutu; cv=none; t=1768255809; b=MDNFdilVYDfQxFMdKxNp5dibOQ5qVfWto6X6d/326OxMxtkxRTf6QYj2/n16GmHu83MFnX 4danjE9+SkklvHf/gqRLa2VAwcGtuzONe5WSUFOxGpRpk/VWYuKcFqfSjzD+39PazRT/9T uDJdeNgomKf7e7Wj+s/eUsq+wdDv5Rtz1Oe14o5a3r4H/AQWGJGD4sTLAtUSPxomBFApxQ maNGg8eY6YS1hu4hb8emUspX86iubuarWFib1CnFO3X0IbE5R0iVEjwiymsVcgGno3lP3q Jf5Ls6wZIctxODIUUssXBLdCzNW2cfUkLJSMqO7L1VBN9kfjEfnii7VQkJxiRw== ARC-Authentication-Results: i=1; ORIGINATING; auth=pass smtp.auth=hlinnaka smtp.mailfrom=hlinnaka@iki.fi Message-ID: <2e11134f-02c3-43da-8c39-fb520a1a251d@iki.fi> Date: Tue, 13 Jan 2026 00:10:03 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Reduce build times of pg_trgm GIN indexes To: David Geier , pgsql-hackers References: <5d366878-2007-4d31-861e-19294b7a583b@gmail.com> <9ac3931a-180e-4283-a7a8-05eb66099206@iki.fi> Content-Language: en-US From: Heikki Linnakangas In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 09/01/2026 14:06, David Geier wrote: > On 06.01.2026 18:00, Heikki Linnakangas wrote: >> On 05/01/2026 17:01, David Geier wrote: >>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots >>> of text is actually ASCII. Hence, we provide a fast path for this case >>> which is exercised if the MSB of the current character is unset. >> >> This uses pg_ascii_tolower() when for ASCII characters when built with >> the IGNORECASE. I don't think that's correct, if the proper collation >> would do something more complicated for than what pg_ascii_tolower() does. > > Oh, that's evil. I had tested that specifically. But it only worked > because the code in master uses str_tolower() with > DEFAULT_COLLATION_OID. So using a different locale like in the following > example does something different than when creating a database with the > same locale. > > postgres=# select lower('III' COLLATE "tr_TR"); > lower > ------- > ııı > > postgres=# select show_trgm('III' COLLATE "tr_TR"); > show_trgm > ------------------------- > {" i"," ii","ii ",iii} > (1 row) > > But when using tr_TR as default locale of the database the following > happens: > > postgres=# select lower('III' COLLATE "tr_TR"); > lower > ------- > ııı > > postgres=# select show_trgm('III');sü > show_trgm > --------------------------------------- > {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1} > > I'm wondering if that's intentional to begin with. Shouldn't the code > instead pass PG_GET_COLLATION() to str_tolower()? Might require some > research to see how other index types handle locales. > > Coming back to the original problem: the lengthy comment at the top of > pg_locale_libc.c, suggests that in some cases ASCII characters are > handled the pg_ascii_tolower() way for the default locale. See for > example tolower_libc_mb(). So a character by character conversion using > that function will yield a different result than strlower_libc_mb(). I'm > wondering why that is. Hmm, yeah, that feels funny. The trigram code predates per-column collation support, so I guess we never really thought through how it should interact with COLLATE clauses. > Anyways, we could limit the optimization to only kick in when the used > locale follows the same rules as pg_ascii_tolower(). We could test that > when creating the locale and store that info in pg_locale_struct. I think that's only possible for libc locales, which operate one character at a time. In ICU locales, lower-casing a character can depend on the surrounding characters, so you cannot just test the conversion of every ascii character individually. It would make sense for libc locales though, and I hope the ICU functions are a little faster anyway. Although, we probably should be using case-folding rather than lower-casing with ICU locales anyway. Case-folding is designed for string matching. It'd be a backwards-compatibility breaking change, though. - Heikki