Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2oel-006Mvp-2X for pgsql-hackers@arkaria.postgresql.org; Fri, 03 May 2024 08:52:39 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s2oeh-007TG8-Vk for pgsql-hackers@arkaria.postgresql.org; Fri, 03 May 2024 08:52:36 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2oeh-007TG0-Ls for pgsql-hackers@lists.postgresql.org; Fri, 03 May 2024 08:52:36 +0000 Received: from fout4-smtp.messagingengine.com ([103.168.172.147]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2oeg-001JWM-9G for pgsql-hackers@postgresql.org; Fri, 03 May 2024 08:52:35 +0000 Received: from compute7.internal (compute7.nyi.internal [10.202.2.48]) by mailfout.nyi.internal (Postfix) with ESMTP id 536121380253; Fri, 3 May 2024 04:52:33 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute7.internal (MEProxy); Fri, 03 May 2024 04:52:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=eisentraut.org; h=cc:cc:content-transfer-encoding:content-type:content-type :date:date:from:from:in-reply-to:in-reply-to:message-id :mime-version:references:reply-to:subject:subject:to:to; s=fm2; t=1714726353; x=1714812753; bh=3cyHABo+at8JZS5IQ625x+/PMCE7gnOl 1PTVDdg6X00=; b=SGiPizbtlxzHkT/L06fD0l+fac7OluS22ZUqBPFFV08WfMFy DBaoTDNx1oHKMZ1e3MKzyUL6F2dSq9Oyah5q0ere4tld5chiCNA6h8VVSbYlhu2m vbrFei0eESl+KarfQCVvwhb1lXr5zB8aUs10u4G0VKs3o69k6mHL5h0EmpM5MOO9 vFGZ00BovadREG8jXnEb1SCDF//q4OusWkOPdQ4igEQUsktK8SYXoi421Xqcpgir 98l8YG+XXxZ+mwqc1ijZ22s3b4MexRyUsvuByT0WJOHwi/Ukj/sEYw4QPTAmdBQl iScfp1VPca2nuLhQouEbd+FEIm7CGtNHoNUbAQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1714726353; x= 1714812753; bh=3cyHABo+at8JZS5IQ625x+/PMCE7gnOl1PTVDdg6X00=; b=d wQ0BUNyxqOVR11OLZaUGft8I0/iKJW2UoNpmr3h7srt23/CIIPwwLwJeE7pDuWbB Y/AV1nxcquBDMILMCosu8bI715frBKwjZ5+PlQ7ccVdAOikhz9xpeTl+MdJT91wK qIz+WdbssXqZTGGJ2PjFLiD0nzXzvr7O8+3itJWXycfmq2qB6fQj1Ti/fGSJKtic 2UVfyrAoxG1bT7ulsjdQ54LL/muh9OwJsI5pkEwmq/IoiPRlS+cfRlczPRLlAsbj ZRHiS2xyOCeIyQmuUSnPcUIyXrVK8JcQvIdea9AygzJzxd6N//zDvNSKB/Wk01FC DetmvTGMDMK5zAxHXgREw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrvddvtddgtdejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepkfffgggfuffvvehfhfgjtgfgsehtkeertddtvdejnecuhfhrohhmpefrvght vghrucfgihhsvghnthhrrghuthcuoehpvghtvghrsegvihhsvghnthhrrghuthdrohhrgh eqnecuggftrfgrthhtvghrnhepjefhveehtdetgfffhffhfeefgffghffflefgieeuueek hedvhedvfeehffdvfeeunecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrg hilhhfrhhomhepphgvthgvrhesvghishgvnhhtrhgruhhtrdhorhhg X-ME-Proxy: Feedback-ID: ie0a040ee:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 3 May 2024 04:52:32 -0400 (EDT) Message-ID: <68263a89-b6af-4705-ac08-9a57cdd63bd0@eisentraut.org> Date: Fri, 3 May 2024 10:52:31 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Support LIKE with nondeterministic collations To: Robert Haas Cc: Daniel Verite , pgsql-hackers References: Content-Language: en-US From: Peter Eisentraut In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 03.05.24 02:11, Robert Haas wrote: > On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut wrote: >> On 30.04.24 14:39, Daniel Verite wrote: >>> postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct; >>> ?column? >>> ---------- >>> f >>> (1 row) >>> >>> The first two results look fine, but the next one is inconsistent. >> >> This is correct, because '_' means "any single character". This is >> independent of the collation. > > Seems really counterintuitive. I had to think for a long time to be > able to guess what was happening here. Finally I came up with this > guess: > > If the collation-aware matching tries to match up f with the initial > period, the period is skipped and the f matches f. But when the > wildcard is matched to the initial period, that uses up the wildcard > and then we're left trying to match o with f, which doesn't work. Formally, what X like '_oo' means is, can X be partitioned into substrings such that the first substring is a single character and the second substring is equal to 'oo' under the applicable collation? This is false in this case, there is no such partitioning. What the implementation does is, it walks through the pattern. It sees '_', so it steps over one character in the input string, which is '.' here. Then we have 'foo.' left to match in the input string. Then it takes from the pattern the next substring up to but not including either a wildcard character or the end of the string, which is 'oo', and then it checks if a prefix of the remaining input string can be found that is "equal to" 'oo'. So here it would try in turn '' = 'oo' collate ign_punct ? 'f' = 'oo' collate ign_punct ? 'fo' = 'oo' collate ign_punct ? 'foo' = 'oo' collate ign_punct ? 'foo.' = 'oo' collate ign_punct ? and they all fail, so the match fails. > It'd probably be good to use something like this as an example in the > documentation. My intuition is that if foo matches a string, then _oo > f_o and fo_ should also match that string. Apparently that's not the > case, but I doubt I'll be the last one who thinks it should be. This intuition fails because with nondeterministic collations, strings of different lengths can be equal, and so the question arises, what does the pattern '_' mean. It could mean either, (1) a single character, or perhaps something like, (2) a string that is equal to some other string of length one. The second definition would satisfy the expectation here, because then '.f' matches '_' because '.f' is equal to some string of length one, such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.) However, off the top of my head, this definition has three flaws: (1) It would make the single-character wildcard effectively an any-number-of-characters wildcard, but only in some circumstances, which could be confusing, (2) it would be difficult to compute, because you'd have to check equality against all possible single-character strings, and (3) it is not what the SQL standard says. In any case, yes, some explanation and examples should be added.