public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Use CASEFOLD() internally rather than LOWER()
4+ messages / 2 participants
[nested] [flat]

* Re: Use CASEFOLD() internally rather than LOWER()
@ 2026-02-28 13:27 Daniel Verite <[email protected]>
  2026-03-22 03:14 ` Re: Use CASEFOLD() internally rather than LOWER() Mark Dilger <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Daniel Verite @ 2026-02-28 13:27 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: pgsql-hackers

	Jeff Davis wrote:

> There are a number of internal callers of LOWER(), and conceptually
> those should all be using CASEFOLD(). Patches attached.

I tried 0001 with a non-UTF8 database and got quickly stuck:

create database latin9 encoding='LATIN9' 
 template=template0
 locale='fr_FR.ISO-8859-15@euro';

\c latin9

select 'abc' ilike 'bc';
ERROR:	Unicode case folding can only be performed if server encoding is UTF8

Presumably the internal calls to casefold should be conditionalized
on the encoding being UTF-8?


Best regards,
-- 
Daniel Vérité 
https://postgresql.verite.pro/






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Use CASEFOLD() internally rather than LOWER()
  2026-02-28 13:27 Re: Use CASEFOLD() internally rather than LOWER() Daniel Verite <[email protected]>
@ 2026-03-22 03:14 ` Mark Dilger <[email protected]>
  2026-03-25 14:40   ` Re: Use CASEFOLD() internally rather than LOWER() Mark Dilger <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Mark Dilger @ 2026-03-22 03:14 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers

On Tue, Mar 3, 2026 at 1:01 PM Jeff Davis <[email protected]> wrote:

> On Sat, 2026-02-28 at 14:27 +0100, Daniel Verite wrote:
> > I tried 0001 with a non-UTF8 database and got quickly stuck:
>
> Attached new versions. I moved the encoding check into the SQL-callable
> casefold() function, and other callers use str_casefold(). That
> slightly simplifies what happens in ILIKE, also.
>
> I removed the citext changes. citext has somewhat of a legacy status, I
> think, so I'm not sure it makes sense to try to modernize or change it.
> Also, some SQL-language functions in citext use LOWER(), so the changes
> aren't enough: we'd need to make the SQL CASEFOLD function callable in
> other encodings, and also run a citext upgrade script to change the
> definitions.
>
> Note that these changes affect the result of some expressions (e.g.
> ILIKE), so could theoretically make an expression index or predicate
> index inconsistent.
>

Thanks for the patches!

After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm still
uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
With builtin collations, these produce different results.


Attachments:

  [application/octet-stream] WIP-v3-0001-Demonstrate-inconsistency-in-gin-index-vs-seq-sca.patch-WIP (12.1K, 3-WIP-v3-0001-Demonstrate-inconsistency-in-gin-index-vs-seq-sca.patch-WIP)
  download

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Use CASEFOLD() internally rather than LOWER()
  2026-02-28 13:27 Re: Use CASEFOLD() internally rather than LOWER() Daniel Verite <[email protected]>
  2026-03-22 03:14 ` Re: Use CASEFOLD() internally rather than LOWER() Mark Dilger <[email protected]>
@ 2026-03-25 14:40   ` Mark Dilger <[email protected]>
  2026-03-26 00:01     ` Re: Use CASEFOLD() internally rather than LOWER() Mark Dilger <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Mark Dilger @ 2026-03-25 14:40 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers

On Tue, Mar 24, 2026 at 4:07 PM Jeff Davis <[email protected]> wrote:

> On Sat, 2026-03-21 at 20:14 -0700, Mark Dilger wrote:
> > After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm
> > still
> > uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
> > With builtin collations, these produce different results.
>
> Interesting, thank you. As stated in the original message, I was unsure
> about changing pg_trgm without adjusting the regex logic, also:
>
>
> https://www.postgresql.org/message-id/[email protected]
>
> do you have a suggestion about an easy way to do that, or should we
> revisit in the next cycle?
>

pg_trgm appears to be lossy, with recheck logic.  I would think you just
need to make it give answers which at least include everything that a regex
would match, and then allow recheck to prune that down.  My concern is
having pg_trgm give less than all the answers, so that after recheck you
get fewer results than a seqscan would have returned.  Would switching to
casefold be strictly broader than regex?  If so, you would just need to
convert pg_trgm to use casefold and then rely on the recheck machinery.

Sorry if this misses something discussed upthread.  I'm clearly assuming
here that you don't mind that such a change necessitates a REINDEX.

-- 

*Mark Dilger*


^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Use CASEFOLD() internally rather than LOWER()
  2026-02-28 13:27 Re: Use CASEFOLD() internally rather than LOWER() Daniel Verite <[email protected]>
  2026-03-22 03:14 ` Re: Use CASEFOLD() internally rather than LOWER() Mark Dilger <[email protected]>
  2026-03-25 14:40   ` Re: Use CASEFOLD() internally rather than LOWER() Mark Dilger <[email protected]>
@ 2026-03-26 00:01     ` Mark Dilger <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Mark Dilger @ 2026-03-26 00:01 UTC (permalink / raw)
  To: Jeff Davis <[email protected]>; +Cc: Daniel Verite <[email protected]>; pgsql-hackers

On Wed, Mar 25, 2026 at 2:02 PM Jeff Davis <[email protected]> wrote:

> I think the precise question would be: "are there any two characters
> that lowercase to the same character but do not casefold to the same
> character?".
>

I don't know.  I'll set up a test to iterate across all locales across all
character pairs... no, I didn't find any on my system.  Other searching
suggests that the Turkish and Azerbaijani locale do have this
characteristic, with I (U+0049) lowercasing to ı (U+0131) and case folding
to i (U+0069) while ı (U+0131) lowercases to ı (U+0131) but also case folds
to ı (U+0131).  I have not confirmed that empirically, though.


> I don't have a counterexample, so perhaps using casefold would still be
> fine.
>
> Thoughts? Should we enhance regexes to consider more than two case
> variants first, or should we proceed with some of these patches (and/or
> a similar change to pg_trgm)?
>

I don't want to take a strong position either way.  I'm still wrapping my
head around the various implications of the proposed changes, and don't
feel I have a complete picture yet.

-- 

*Mark Dilger*


^ permalink  raw  reply  [nested|flat] 4+ messages in thread


end of thread, other threads:[~2026-03-26 00:01 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-28 13:27 Re: Use CASEFOLD() internally rather than LOWER() Daniel Verite <[email protected]>
2026-03-22 03:14 ` Mark Dilger <[email protected]>
2026-03-25 14:40   ` Mark Dilger <[email protected]>
2026-03-26 00:01     ` Mark Dilger <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox