From: Tom Lane <tgl@sss.pgh.pa.us>
To: Rahman Duran <a.rahman.duran@gmail.com>
cc: pgsql-general@lists.postgresql.org
Subject: Re: PostgreSQL 18.1 non deterministic collation "LIKE %abc%" performance
In-reply-to: <CA+jeBrejfchM60epUo1J3M8mMAQOcCWotpDP9QZYgEL=j0b=pg@mail.gmail.com>
References: <CA+jeBrejfchM60epUo1J3M8mMAQOcCWotpDP9QZYgEL=j0b=pg@mail.gmail.com>
Comments: In-reply-to Rahman Duran <a.rahman.duran@gmail.com>
	message dated "Tue, 30 Dec 2025 09:30:27 +0300"
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <3287813.1767077500.1@sss.pgh.pa.us>
Date: Tue, 30 Dec 2025 01:51:40 -0500
Message-ID: <3287814.1767077500@sss.pgh.pa.us>
Archived-At: <https://www.postgresql.org/message-id/3287814.1767077500%40sss.pgh.pa.us>
Precedence: bulk

Rahman Duran <a.rahman.duran@gmail.com> writes:
> After the release of the PostgreSQL 18 version, I am trying non
> determisinstic collation with LIKE pattern matching support. I am mostly
> searching with "LIKE %search_term%" on about 10 text columns. As I use
> wildcard prefix and suffix, I can't use btree index anyways. So I decided
> to try non deterministic collation support so I can simplify application
> code. I am testing this on a table with ~60K rows. With this row count and
> search pattern, non deterministic collation seems at least 10 times slower
> than LOWER LIKE and ILIKE.

This is not terribly surprising: non-deterministic collations disable
a lot of lower-level optimizations in pattern matching.  I think the
particular one that is biting you is probably this bit in
src/backend/utils/adt/like_match.c:

             * ... With a nondeterministic collation, we can't
             * rely on the first bytes being equal, so we have to recurse in
             * any case.

or possibly the later bit

             * For nondeterministic locales, we find the next substring of the
             * pattern that does not contain wildcards and try to find a
             * matching substring in the text.  Crucially, we cannot do this
             * character by character, as in the normal case, but must do it
             * substring by substring, partitioned by the wildcard characters.
             * (This is per SQL standard.)

The fundamental problem here is not wanting to make assumptions about
which character strings a non-deterministic collation will consider
equal to which other character strings.  If you have concrete ideas
about how to improve that, let's hear them.

			regards, tom lane