Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5gQH-003W03-1a for pgsql-hackers@arkaria.postgresql.org; Thu, 26 Mar 2026 08:50:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w5gQE-001jOV-31 for pgsql-hackers@arkaria.postgresql.org; Thu, 26 Mar 2026 08:50:35 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5gQE-001jOL-1v for pgsql-hackers@lists.postgresql.org; Thu, 26 Mar 2026 08:50:35 +0000 Received: from mail-wm1-x331.google.com ([2a00:1450:4864:20::331]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w5gQD-0000000173y-15Dx for pgsql-hackers@postgresql.org; Thu, 26 Mar 2026 08:50:34 +0000 Received: by mail-wm1-x331.google.com with SMTP id 5b1f17b1804b1-486b9675d36so6319835e9.0 for ; Thu, 26 Mar 2026 01:50:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774515032; x=1775119832; darn=postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=t7JOwx+SzYq38Aee+DIE8W8/jzlMmh9pVk85llWzhSQ=; b=ZEoGf+71/UP70zGYMambNB3mvAtJ55B0+zH5apAqz3nsd9ugVOesrP69INnhvfoxY9 x2Z9F+eaLHh/a4yZTIV03c7w6NtKVGK33MvPBfmT0BZO/UOgQhtrR0QOX/9lDAI/0Exr bEYRkcum08aUj0nWX23PCab8GhhSVFzOVG+uzDZVSYqoZCPGG6oTzoYoqqAPDgG0eoBV T68cslll5gYzyffRisBgxZdG5lG6c3vxRQ2au/H4OeL+hQqzE9VbIZwr3/ZlL60AzF9y F1+fG8WUfTp47cHxL55/F9wSPVsz6oYKlos7ep8WiOtgKN/sfVaStosne+oVdiLrZiOQ dtiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774515032; x=1775119832; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=t7JOwx+SzYq38Aee+DIE8W8/jzlMmh9pVk85llWzhSQ=; b=a3myLmAMFWNXMHvpYNo2K7wziY4lXlYdx6fBlnGaLAkeSrDq1h+XhIHD7CMB+Doo0s aXeke6Q2leeODln3OYsWaI3mc1hnbmjZGde/Q+4MZUW2smgDnLyQxHY3pXEuZlyXHtcT /nMPDjiRdPXePRXEBhdvEhZXUGbUe8muZepJEA/3bcxWDNVaDeVtMj9w0Cl59B3fFU+B 6W+xF10QyYMwP328uQ/6xqIjWiVCA5HN1gMBCU965lF+v72URD5yjpaWrfTaGmcUN6sI LE/oPDsGuLV1p9LFgXF0zX9yrnJbbpWxVwEpcqE50VC9HvgdVxKLxNEkanMlb7hrlMAW h29w== X-Forwarded-Encrypted: i=1; AJvYcCVzW5pERalf5yoSgwd1uBj8r4iwHhFxyhjLl5BwY3KxZvqX24hERZqn+oSdyoujnLtb3tUO0vgneNbOzbX6@postgresql.org X-Gm-Message-State: AOJu0YxLpRVPkCUmLxeTXi8U62kU99ZSBWtONCaUZldcz9GW8aCrJ/hK Ax9PwcN5y8YpP6FMLffMuW0GaY9NMdMtBMCzST6me5HpGr4VsUiF4h3r X-Gm-Gg: ATEYQzxCHsfnD16RwoEksVW68BEOGJXYUJXg0+MxTe4XH8EQtZJtquYCIJEhrlWayWc whvaX7w6ssXmOrCa0lW1CYVFubpgYHBeNM5lXvIDg1sLvQzLXXv+jSSskBvYhTKXzkv7G3DKp0/ i+RE+00CV7KzDVXIl8uSyB7G/nI8LlDSoNIO8NP9YA9Tu9pF4ywL71yC8X9PmynKYCIRywW9hjG NGgBGrFQfdWLxyTBVaZBL3AKX89QLNt8SKE0ofuuB0MnxIJp/+lWLAfKufmO7inWwzdn+Tdv3CL nF9UOhEGsv1f4qV9AG/F1eybFYRDs/ZZ9/j5DIj6PAcsGsSLe6Yg98pQVwYV+6DmLqt58v3eFrT NlM7tkrZw6C0pV3qXp6pYAeXxm1XW2pLXvULHBdQ/VSstDQfovK33CWLY9ehM9NBv6bdgnPmq2J BF3fVf4aGmLFRtINKZJkQnUs6c X-Received: by 2002:a05:600c:3b12:b0:485:419c:4eba with SMTP id 5b1f17b1804b1-48715fc3286mr98452465e9.1.1774515031625; Thu, 26 Mar 2026 01:50:31 -0700 (PDT) Received: from [172.31.5.233] ([147.161.235.1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-48722d236a9sm36673545e9.11.2026.03.26.01.50.29 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 26 Mar 2026 01:50:31 -0700 (PDT) Message-ID: <7e11acde-9d5b-49a1-9c41-23096d51d2e2@gmail.com> Date: Thu, 26 Mar 2026 09:50:28 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Use correct collation in pg_trgm To: Jeff Davis , pgsql-hackers , Heikki Linnakangas References: <2c15502fd399128ee27fbe1a305e006780159f66.camel@j-davis.com> Content-Language: en-US From: David Geier In-Reply-To: <2c15502fd399128ee27fbe1a305e006780159f66.camel@j-davis.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Jeff! > This area is a bit awkward conceptually. The case you found is not > about the *sort order* of the values; it's about the casing semantics. > We mix those two concepts into a single "collation oid" that determines > both sort order and casing semantics (and pattern matching semantics, > too). > > LOWER() and UPPER() take the casing semantics from the inferred > collation, so that's a good argument that you're doing the right thing > here. > > But full text search does not; it uses DEFAULT_COLLATION_OID for > parsing the input. That sort of makes sense, because tsvector/tsquery > don't have a collatable sort order -- it's more about the parsing > semantics to create the values in the first place, not about how the > tsvector/tsquery values are sorted. For pg_trgm it's also not only about casing but also about parsing: the decision of what is considered an alpha-numeric character in ISWORDCHR() depends on the collation. > So that leaves me wondering: why would pg_trgm use the inferred > collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem > conceptually similar, and the only real difference I see is that > tsvector/tsquery are types and pg_trgm is a set of functions. I agree. That is inconsistent. But if anything, shouldn't we change tsvector/tsquery to as well adhere to the inferred collation? For example, when a user specifies a collation for some table column, he expects the collation to not only impact sort order. With say collation en-US-x-icu, the B-tree lookup will be case-insensitive. Why would a GIN index suddenly not adhere to the collation? That seems counter-intuitive and confusing. The same when using tsvector/tsquery. More generally: shouldn't it, from a user's point-of-view, be an all or nothing to avoid surprises? If not, we should come up with easy to understand and easy to remember reasons for what adheres to the inferred collation and what adheres the default collation and document that. > Note that I made some changes here recently: full text search and ltree > used to use libc unconditionally or a mix of libc and > DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to > consistently use DEFAULT_COLLATION_OID. But I didn't resolve the > conceptual problem of whether we should use the inferred collation (as > you suggest) or not. Thanks for the heads up. -- David Geier