public inbox for [email protected]
help / color / mirror / Atom feedSpeed up ICU case conversion by using ucasemap_utf8To*()
17+ messages / 7 participants
[nested] [flat]
* Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2024-12-20 05:20 Andreas Karlsson <[email protected]>
0 siblings, 2 replies; 17+ messages in thread
From: Andreas Karlsson @ 2024-12-20 05:20 UTC (permalink / raw)
To: pgsql-hackers; +Cc: Jeff Davis <[email protected]>
Hi,
Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.
Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);
master: ~540 ms
Patched: ~460 ms
glibc: ~410 ms
I have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
Andreas
Attachments:
[text/x-patch] v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch (6.7K, 2-v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch)
download | inline diff:
From 5a355ef083cc7de92ae1e5dcc0198866a07919eb Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Tue, 17 Dec 2024 22:47:00 +0100
Subject: [PATCH v1 1/2] Use optimized versions of ICU case conversion for
UTF-8
Instead of converting to and from UChar when doing case conversions we
use the UTF-8 versions of the functions. This can give a signficant
speedup, 15-20%, on short to medium length strings.
---
src/backend/utils/adt/pg_locale_icu.c | 161 ++++++++++++++++++--------
1 file changed, 114 insertions(+), 47 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index f0a77a767e7..eea6f48f6c3 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -12,6 +12,7 @@
#include "postgres.h"
#ifdef USE_ICU
+#include "unicode/ucasemap.h"
#include <unicode/ucnv.h>
#include <unicode/ustring.h>
@@ -100,9 +101,9 @@ static size_t icu_from_uchar(char *dest, size_t destsize,
const UChar *buff_uchar, int32_t len_uchar);
static void icu_set_collation_attributes(UCollator *collator, const char *loc,
UErrorCode *status);
-static int32_t icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source,
- int32_t len_source);
+static int32_t icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source,
+ int32_t len_source);
static int32_t u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
@@ -350,60 +351,126 @@ size_t
strlower_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToLower, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ UCaseMap *casemap;
+ int32_t needed;
+
+ casemap = ucasemap_open(locale->info.icu.locale, U_FOLD_CASE_DEFAULT, &status);
+ if (U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("casemap lookup failed: %s", u_errorName(status))));
+
+ status = U_ZERO_ERROR;
+ needed = ucasemap_utf8ToLower(casemap, dest, destsize, src, srclen, &status);
+ ucasemap_close(casemap);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ {
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(u_strToLower, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+ }
}
size_t
strtitle_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToTitle_default_BI, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ UCaseMap *casemap;
+ int32_t needed;
+
+ casemap = ucasemap_open(locale->info.icu.locale, U_FOLD_CASE_DEFAULT, &status);
+ if (U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("casemap lookup failed: %s", u_errorName(status))));
+
+ status = U_ZERO_ERROR;
+ needed = ucasemap_utf8ToTitle(casemap, dest, destsize, src, srclen, &status);
+ ucasemap_close(casemap);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ {
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(u_strToTitle_default_BI, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+ }
}
size_t
strupper_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToUpper, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ UCaseMap *casemap;
+ int32_t needed;
+
+ casemap = ucasemap_open(locale->info.icu.locale, U_FOLD_CASE_DEFAULT, &status);
+ if (U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("casemap lookup failed: %s", u_errorName(status))));
+
+ status = U_ZERO_ERROR;
+ needed = ucasemap_utf8ToUpper(casemap, dest, destsize, src, srclen, &status);
+ ucasemap_close(casemap);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ {
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(u_strToUpper, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+ }
}
/*
@@ -599,8 +666,8 @@ icu_from_uchar(char *dest, size_t destsize, const UChar *buff_uchar, int32_t len
}
static int32_t
-icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source, int32_t len_source)
+icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source, int32_t len_source)
{
UErrorCode status;
int32_t len_dest;
--
2.45.2
[text/x-patch] v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patch (3.9K, 3-v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patch)
download | inline diff:
From a4bfcbd8d9ad9c56995fa4a6736480fc11ce1bd4 Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Fri, 20 Dec 2024 02:00:33 +0100
Subject: [PATCH v1 2/2] Reduce code duplication in ICU case mapping code
---
src/backend/utils/adt/pg_locale_icu.c | 74 ++++++++++-----------------
1 file changed, 26 insertions(+), 48 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index eea6f48f6c3..905b2308fbd 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -101,6 +101,9 @@ static size_t icu_from_uchar(char *dest, size_t destsize,
const UChar *buff_uchar, int32_t len_uchar);
static void icu_set_collation_attributes(UCollator *collator, const char *loc,
UErrorCode *status);
+static int32_t icu_convert_case_no_utf8(ICU_Convert_Func func, char *dest,
+ size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
static int32_t icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
UChar **buff_dest, UChar *buff_source,
int32_t len_source);
@@ -371,22 +374,7 @@ strlower_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
return needed;
}
else
- {
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case_uchar(u_strToLower, locale, &buff_conv,
- buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
- }
+ return icu_convert_case_no_utf8(u_strToLower, dest, destsize, src, srclen, locale);
}
size_t
@@ -413,22 +401,7 @@ strtitle_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
return needed;
}
else
- {
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case_uchar(u_strToTitle_default_BI, locale, &buff_conv,
- buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
- }
+ return icu_convert_case_no_utf8(u_strToTitle_default_BI, dest, destsize, src, srclen, locale);
}
size_t
@@ -455,22 +428,7 @@ strupper_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
return needed;
}
else
- {
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case_uchar(u_strToUpper, locale, &buff_conv,
- buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
- }
+ return icu_convert_case_no_utf8(u_strToUpper, dest, destsize, src, srclen, locale);
}
/*
@@ -665,6 +623,26 @@ icu_from_uchar(char *dest, size_t destsize, const UChar *buff_uchar, int32_t len
return len_result;
}
+static int32_t
+icu_convert_case_no_utf8(ICU_Convert_Func func, char *dest, size_t destsize,
+ const char *src, ssize_t srclen, pg_locale_t locale)
+{
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(func, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+}
+
static int32_t
icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
UChar **buff_dest, UChar *buff_source, int32_t len_source)
--
2.45.2
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2024-12-20 19:24 Jeff Davis <[email protected]>
parent: Andreas Karlsson <[email protected]>
1 sibling, 1 reply; 17+ messages in thread
From: Jeff Davis @ 2024-12-20 19:24 UTC (permalink / raw)
To: Andreas Karlsson <[email protected]>; pgsql-hackers
On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>
> master: ~540 ms
> Patched: ~460 ms
> glibc: ~410 ms
It looks like you are opening and closing the UCaseMap object each
time. Why not save it in pg_locale_t? That should speed it up even more
and hopefully beat libc.
Also, to support older ICU versions consistently, we need to fix up the
locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
that logic?
Regards,
Jeff Davis
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2025-03-17 06:46 vignesh C <[email protected]>
parent: Andreas Karlsson <[email protected]>
1 sibling, 1 reply; 17+ messages in thread
From: vignesh C @ 2025-03-17 06:46 UTC (permalink / raw)
To: Andreas Karlsson <[email protected]>; +Cc: pgsql-hackers; Jeff Davis <[email protected]>
On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <[email protected]> wrote:
>
> Hi,
>
> Jeff pointed out to me that the case conversion functions in ICU have
> UTF-8 specific versions which means we can call those directly if the
> database encoding is UTF-8 and skip having to convert to and from UChar.
>
> Since most people today run their databases in UTF-8 I think this
> optimization is worth it and when measuring on short to medium length
> strings I got a 15-20% speed up. It is still slower than glibc in my
> benchmarks but the gap is smaller now.
>
> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>
> master: ~540 ms
> Patched: ~460 ms
> glibc: ~410 ms
>
> I have also attached a clean up patch for the non-UTF-8 code paths. I
> thought about doing the same for the new UTF-8 code paths but it turned
> out to be a bit messy due to different function signatures for
> ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
I noticed that Jeff's comments from [1] have not yet been addressed, I
have changed the commitfest entry status to "Waiting on Author",
please address them and update it to "Needs Review".
[1] - https://www.postgresql.org/message-id/[email protected]
Regards,
Vignesh
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2025-03-29 18:50 Andres Freund <[email protected]>
parent: vignesh C <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Andres Freund @ 2025-03-29 18:50 UTC (permalink / raw)
To: vignesh C <[email protected]>; +Cc: Andreas Karlsson <[email protected]>; pgsql-hackers; Jeff Davis <[email protected]>
On 2025-03-17 12:16:11 +0530, vignesh C wrote:
> On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <[email protected]> wrote:
> >
> > Hi,
> >
> > Jeff pointed out to me that the case conversion functions in ICU have
> > UTF-8 specific versions which means we can call those directly if the
> > database encoding is UTF-8 and skip having to convert to and from UChar.
> >
> > Since most people today run their databases in UTF-8 I think this
> > optimization is worth it and when measuring on short to medium length
> > strings I got a 15-20% speed up. It is still slower than glibc in my
> > benchmarks but the gap is smaller now.
> >
> > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
> >
> > master: ~540 ms
> > Patched: ~460 ms
> > glibc: ~410 ms
> >
> > I have also attached a clean up patch for the non-UTF-8 code paths. I
> > thought about doing the same for the new UTF-8 code paths but it turned
> > out to be a bit messy due to different function signatures for
> > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
>
> I noticed that Jeff's comments from [1] have not yet been addressed, I
> have changed the commitfest entry status to "Waiting on Author",
> please address them and update it to "Needs Review".
> [1] - https://www.postgresql.org/message-id/[email protected]
It's also worth noting that this patch hasn't been building for quite a while
(at least not since 2025-01-29):
https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
[17:17:51.214] ld: error: undefined symbol: icu_convert_case
[17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
[17:17:51.214] >>> src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
[17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)
I think we can mark this as returned-with-feedback for now?
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2025-03-30 01:18 vignesh C <[email protected]>
parent: Andres Freund <[email protected]>
0 siblings, 0 replies; 17+ messages in thread
From: vignesh C @ 2025-03-30 01:18 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: Andreas Karlsson <[email protected]>; pgsql-hackers; Jeff Davis <[email protected]>
On Sun, 30 Mar 2025 at 00:20, Andres Freund <[email protected]> wrote:
>
> On 2025-03-17 12:16:11 +0530, vignesh C wrote:
> > On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > Jeff pointed out to me that the case conversion functions in ICU have
> > > UTF-8 specific versions which means we can call those directly if the
> > > database encoding is UTF-8 and skip having to convert to and from UChar.
> > >
> > > Since most people today run their databases in UTF-8 I think this
> > > optimization is worth it and when measuring on short to medium length
> > > strings I got a 15-20% speed up. It is still slower than glibc in my
> > > benchmarks but the gap is smaller now.
> > >
> > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> > > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
> > >
> > > master: ~540 ms
> > > Patched: ~460 ms
> > > glibc: ~410 ms
> > >
> > > I have also attached a clean up patch for the non-UTF-8 code paths. I
> > > thought about doing the same for the new UTF-8 code paths but it turned
> > > out to be a bit messy due to different function signatures for
> > > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
> >
> > I noticed that Jeff's comments from [1] have not yet been addressed, I
> > have changed the commitfest entry status to "Waiting on Author",
> > please address them and update it to "Needs Review".
> > [1] - https://www.postgresql.org/message-id/[email protected]
>
> It's also worth noting that this patch hasn't been building for quite a while
> (at least not since 2025-01-29):
>
> https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
> [17:17:51.214] ld: error: undefined symbol: icu_convert_case
> [17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
> [17:17:51.214] >>> src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
> [17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)
>
> I think we can mark this as returned-with-feedback for now?
Thanks, the commitfest entry is marked as returned with feedback.
@Andreas Karlsson Feel free to add a new commitfest entry when you
have addressed the feedback.
Regards,
Vignesh
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2025-12-31 00:18 Andreas Karlsson <[email protected]>
parent: Jeff Davis <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Andreas Karlsson @ 2025-12-31 00:18 UTC (permalink / raw)
To: Jeff Davis <[email protected]>; pgsql-hackers
On 12/20/24 8:24 PM, Jeff Davis wrote:
> On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
>> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
>> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>>
>> master: ~540 ms
>> Patched: ~460 ms
>> glibc: ~410 ms
>
> It looks like you are opening and closing the UCaseMap object each
> time. Why not save it in pg_locale_t? That should speed it up even more
> and hopefully beat libc.
Fixed. New benchmarks are:
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);
master: ~570 ms
Patched: ~340 ms
glibc: ~400 ms
So it does indeed seem like we got a further speedup and now are faster
than glibc.
> Also, to support older ICU versions consistently, we need to fix up the
> locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
> that logic?
Fixed.
Andreas
Attachments:
[text/x-patch] v2-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch (13.6K, 2-v2-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch)
download | inline diff:
From 138ecc65c85aeec7a1c0459f69642fd1ea3103db Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Tue, 17 Dec 2024 22:47:00 +0100
Subject: [PATCH v2] Use optimized versions of ICU case conversion for UTF-8
Instead of converting to and from UChar when doing case conversions we
use the UTF-8 versions of the functions. This can give a signficant
speedup, 30-40%, on short to medium length strings.
The only cost we incur is that we have to allocate a casemap object on
locale initialization for UTF-8 databases but the object is realtively
small and the assumption is that most users will at some point want to
run case conversion functions.
While at it we also remove some duplication in the non-UTF-8 code.
---
src/backend/utils/adt/pg_locale_icu.c | 253 +++++++++++++++++---------
src/include/utils/pg_locale.h | 2 +
2 files changed, 164 insertions(+), 91 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 43d44fe43bd..02d9efd0d64 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -52,6 +52,7 @@ extern pg_locale_t create_pg_locale_icu(Oid collid, MemoryContext context);
#ifdef USE_ICU
extern UCollator *pg_ucol_open(const char *loc_str);
+static UCaseMap *pg_ucasemap_open(const char *loc_str);
static size_t strlower_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
@@ -111,9 +112,12 @@ static size_t icu_from_uchar(char *dest, size_t destsize,
const UChar *buff_uchar, int32_t len_uchar);
static void icu_set_collation_attributes(UCollator *collator, const char *loc,
UErrorCode *status);
-static int32_t icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source,
- int32_t len_source);
+static int32_t icu_convert_case_no_utf8(ICU_Convert_Func func, char *dest,
+ size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
+static int32_t icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source,
+ int32_t len_source);
static int32_t u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
@@ -140,6 +144,8 @@ tolower_icu(pg_wchar wc, pg_locale_t locale)
return u_tolower(wc);
}
+static int32_t icu_foldcase_options(const char *locale);
+
static const struct collate_methods collate_methods_icu = {
.strncoll = strncoll_icu,
.strnxfrm = strnxfrm_icu,
@@ -278,6 +284,7 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
const char *icurules = NULL;
UCollator *collator;
locale_t loc = (locale_t) 0;
+ UCaseMap *casemap = NULL;
pg_locale_t result;
if (collid == DEFAULT_COLLATION_OID)
@@ -339,10 +346,14 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
collator = make_icu_collator(iculocstr, icurules);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ casemap = pg_ucasemap_open(iculocstr);
+
result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
result->icu.locale = MemoryContextStrdup(context, iculocstr);
result->icu.ucol = collator;
result->icu.lt = loc;
+ result->icu.ucasemap = casemap;
result->deterministic = deterministic;
result->collate_is_c = false;
result->ctype_is_c = false;
@@ -366,41 +377,18 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
#ifdef USE_ICU
/*
- * Wrapper around ucol_open() to handle API differences for older ICU
- * versions.
- *
- * Ensure that no path leaks a UCollator.
+ * In ICU versions 54 and earlier, "und" is not a recognized spelling of the
+ * root locale. If the first component of the locale is "und", replace with
+ * "root" before opening.
*/
-UCollator *
-pg_ucol_open(const char *loc_str)
+static char *
+fix_icu_locale_str(const char *loc_str)
{
- UCollator *collator;
- UErrorCode status;
- const char *orig_str = loc_str;
- char *fixed_str = NULL;
-
- /*
- * Must never open default collator, because it depends on the environment
- * and may change at any time. Should not happen, but check here to catch
- * bugs that might be hard to catch otherwise.
- *
- * NB: the default collator is not the same as the collator for the root
- * locale. The root locale may be specified as the empty string, "und", or
- * "root". The default collator is opened by passing NULL to ucol_open().
- */
- if (loc_str == NULL)
- elog(ERROR, "opening default collator is not supported");
-
- /*
- * In ICU versions 54 and earlier, "und" is not a recognized spelling of
- * the root locale. If the first component of the locale is "und", replace
- * with "root" before opening.
- */
if (U_ICU_VERSION_MAJOR_NUM < 55)
{
char lang[ULOC_LANG_CAPACITY];
+ UErrorCode status = U_ZERO_ERROR;
- status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
@@ -413,15 +401,49 @@ pg_ucol_open(const char *loc_str)
if (strcmp(lang, "und") == 0)
{
const char *remainder = loc_str + strlen("und");
+ char *fixed_str;
fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
strcpy(fixed_str, "root");
strcat(fixed_str, remainder);
- loc_str = fixed_str;
+ return fixed_str;
}
}
+ return NULL;
+}
+
+/*
+ * Wrapper around ucol_open() to handle API differences for older ICU
+ * versions.
+ *
+ * Ensure that no path leaks a UCollator.
+ */
+UCollator *
+pg_ucol_open(const char *loc_str)
+{
+ UCollator *collator;
+ UErrorCode status;
+ const char *orig_str = loc_str;
+ char *fixed_str;
+
+ /*
+ * Must never open default collator, because it depends on the environment
+ * and may change at any time. Should not happen, but check here to catch
+ * bugs that might be hard to catch otherwise.
+ *
+ * NB: the default collator is not the same as the collator for the root
+ * locale. The root locale may be specified as the empty string, "und", or
+ * "root". The default collator is opened by passing NULL to ucol_open().
+ */
+ if (loc_str == NULL)
+ elog(ERROR, "opening default collator is not supported");
+
+ fixed_str = fix_icu_locale_str(loc_str);
+ if (fixed_str)
+ loc_str = fixed_str;
+
status = U_ZERO_ERROR;
collator = ucol_open(loc_str, &status);
if (U_FAILURE(status))
@@ -456,6 +478,34 @@ pg_ucol_open(const char *loc_str)
return collator;
}
+/*
+ * Wrapper around ucasemap_open() to handle API differences for older ICU
+ * versions.
+ *
+ * Additional makes sure we get the right options for case folding.
+ */
+static UCaseMap *
+pg_ucasemap_open(const char *loc_str)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ UCaseMap *casemap;
+ char *fixed_str;
+
+ fixed_str = fix_icu_locale_str(loc_str);
+ if (fixed_str)
+ loc_str = fixed_str;
+
+ casemap = ucasemap_open(loc_str, icu_foldcase_options(loc_str), &status);
+ if (U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("casemap lookup failed: %s", u_errorName(status))));
+
+ if (fixed_str != NULL)
+ pfree(fixed_str);
+
+ return casemap;
+}
+
/*
* Create a UCollator with the given locale string and rules.
*
@@ -528,80 +578,76 @@ static size_t
strlower_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToLower, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToLower(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strToLower, dest, destsize, src, srclen, locale);
}
static size_t
strtitle_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToTitle_default_BI, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToTitle(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strToTitle_default_BI, dest, destsize, src, srclen, locale);
}
static size_t
strupper_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToUpper, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToUpper(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strToUpper, dest, destsize, src, srclen, locale);
}
static size_t
strfold_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strFoldCase_default, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8FoldCase(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strFoldCase_default, dest, destsize, src, srclen, locale);
}
/*
@@ -829,8 +875,28 @@ icu_from_uchar(char *dest, size_t destsize, const UChar *buff_uchar, int32_t len
}
static int32_t
-icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source, int32_t len_source)
+icu_convert_case_no_utf8(ICU_Convert_Func func, char *dest, size_t destsize,
+ const char *src, ssize_t srclen, pg_locale_t locale)
+{
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(func, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+}
+
+static int32_t
+icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source, int32_t len_source)
{
UErrorCode status;
int32_t len_dest;
@@ -870,10 +936,17 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
UErrorCode *pErrorCode)
+{
+ return u_strFoldCase(dest, destCapacity, src, srcLength,
+ icu_foldcase_options(locale), pErrorCode);
+}
+
+static int32_t
+icu_foldcase_options(const char *locale)
{
uint32 options = U_FOLD_CASE_DEFAULT;
char lang[3];
- UErrorCode status;
+ UErrorCode status = U_ZERO_ERROR;
/*
* Unlike the ICU APIs for lowercasing, titlecasing, and uppercasing, case
@@ -881,7 +954,6 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
* option relevant to Turkic languages 'az' and 'tr'; check for those
* languages to enable the option.
*/
- status = U_ZERO_ERROR;
uloc_getLanguage(locale, lang, 3, &status);
if (U_SUCCESS(status))
{
@@ -893,8 +965,7 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
options = U_FOLD_CASE_EXCLUDE_SPECIAL_I;
}
- return u_strFoldCase(dest, destCapacity, src, srcLength,
- options, pErrorCode);
+ return options;
}
/*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 86016b9344e..a4995e046aa 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -21,6 +21,7 @@
#undef U_SHOW_CPLUSPLUS_HEADER_API
#define U_SHOW_CPLUSPLUS_HEADER_API 0
#include <unicode/ucol.h>
+#include <unicode/ucasemap.h>
#endif
/* use for libc locale names */
@@ -168,6 +169,7 @@ struct pg_locale_struct
const char *locale;
UCollator *ucol;
locale_t lt;
+ UCaseMap *ucasemap;
} icu;
#endif
};
--
2.47.3
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2025-12-31 02:36 =?gb18030?B?emVuZ21hbg==?= <[email protected]>
parent: Andreas Karlsson <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: =?gb18030?B?emVuZ21hbg==?= @ 2025-12-31 02:36 UTC (permalink / raw)
To: =?gb18030?B?QW5kcmVhcyBLYXJsc3Nvbg==?= <[email protected]>; =?gb18030?B?SmVmZiBEYXZpcw==?= <[email protected]>; pgsql-hackers
Hi Andreas,
On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
--
Regards,
Man Zeng
www.openhalo.org
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2025-12-31 15:40 Andreas Karlsson <[email protected]>
parent: =?gb18030?B?emVuZ21hbg==?= <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Andreas Karlsson @ 2025-12-31 15:40 UTC (permalink / raw)
To: zengman <[email protected]>; Jeff Davis <[email protected]>; pgsql-hackers
On 12/31/25 3:36 AM, zengman wrote:
> On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
Thanks for trying it out!
> In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
Fixed.
> In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
Fixed.
Andreas
Attachments:
[text/x-patch] v3-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch (13.7K, 2-v3-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch)
download | inline diff:
From d1a9bc4c1cc15333cc44e7fc21364c7289c8bb49 Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Tue, 17 Dec 2024 22:47:00 +0100
Subject: [PATCH v3] Use optimized versions of ICU case conversion for UTF-8
Instead of converting to and from UChar when doing case conversions we
use the UTF-8 versions of the functions. This can give a signficant
speedup, 30-40%, on short to medium length strings.
The only cost we incur is that we have to allocate a casemap object on
locale initialization for UTF-8 databases but the object is realtively
small and the assumption is that most users will at some point want to
run case conversion functions.
While at it we also remove some duplication in the non-UTF-8 code.
---
src/backend/utils/adt/pg_locale_icu.c | 256 +++++++++++++++++---------
src/include/utils/pg_locale.h | 2 +
2 files changed, 167 insertions(+), 91 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 43d44fe43bd..279429c6c67 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -52,6 +52,7 @@ extern pg_locale_t create_pg_locale_icu(Oid collid, MemoryContext context);
#ifdef USE_ICU
extern UCollator *pg_ucol_open(const char *loc_str);
+static UCaseMap *pg_ucasemap_open(const char *loc_str);
static size_t strlower_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
@@ -111,9 +112,12 @@ static size_t icu_from_uchar(char *dest, size_t destsize,
const UChar *buff_uchar, int32_t len_uchar);
static void icu_set_collation_attributes(UCollator *collator, const char *loc,
UErrorCode *status);
-static int32_t icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source,
- int32_t len_source);
+static int32_t icu_convert_case_no_utf8(ICU_Convert_Func func, char *dest,
+ size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
+static int32_t icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source,
+ int32_t len_source);
static int32_t u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
@@ -140,6 +144,8 @@ tolower_icu(pg_wchar wc, pg_locale_t locale)
return u_tolower(wc);
}
+static int32_t icu_foldcase_options(const char *locale);
+
static const struct collate_methods collate_methods_icu = {
.strncoll = strncoll_icu,
.strnxfrm = strnxfrm_icu,
@@ -278,6 +284,7 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
const char *icurules = NULL;
UCollator *collator;
locale_t loc = (locale_t) 0;
+ UCaseMap *casemap = NULL;
pg_locale_t result;
if (collid == DEFAULT_COLLATION_OID)
@@ -339,10 +346,14 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
collator = make_icu_collator(iculocstr, icurules);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ casemap = pg_ucasemap_open(iculocstr);
+
result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
result->icu.locale = MemoryContextStrdup(context, iculocstr);
result->icu.ucol = collator;
result->icu.lt = loc;
+ result->icu.ucasemap = casemap;
result->deterministic = deterministic;
result->collate_is_c = false;
result->ctype_is_c = false;
@@ -366,41 +377,18 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
#ifdef USE_ICU
/*
- * Wrapper around ucol_open() to handle API differences for older ICU
- * versions.
- *
- * Ensure that no path leaks a UCollator.
+ * In ICU versions 54 and earlier, "und" is not a recognized spelling of the
+ * root locale. If the first component of the locale is "und", replace with
+ * "root" before opening.
*/
-UCollator *
-pg_ucol_open(const char *loc_str)
+static char *
+fix_icu_locale_str(const char *loc_str)
{
- UCollator *collator;
- UErrorCode status;
- const char *orig_str = loc_str;
- char *fixed_str = NULL;
-
- /*
- * Must never open default collator, because it depends on the environment
- * and may change at any time. Should not happen, but check here to catch
- * bugs that might be hard to catch otherwise.
- *
- * NB: the default collator is not the same as the collator for the root
- * locale. The root locale may be specified as the empty string, "und", or
- * "root". The default collator is opened by passing NULL to ucol_open().
- */
- if (loc_str == NULL)
- elog(ERROR, "opening default collator is not supported");
-
- /*
- * In ICU versions 54 and earlier, "und" is not a recognized spelling of
- * the root locale. If the first component of the locale is "und", replace
- * with "root" before opening.
- */
if (U_ICU_VERSION_MAJOR_NUM < 55)
{
char lang[ULOC_LANG_CAPACITY];
+ UErrorCode status = U_ZERO_ERROR;
- status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
@@ -413,15 +401,49 @@ pg_ucol_open(const char *loc_str)
if (strcmp(lang, "und") == 0)
{
const char *remainder = loc_str + strlen("und");
+ char *fixed_str;
fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
strcpy(fixed_str, "root");
strcat(fixed_str, remainder);
- loc_str = fixed_str;
+ return fixed_str;
}
}
+ return NULL;
+}
+
+/*
+ * Wrapper around ucol_open() to handle API differences for older ICU
+ * versions.
+ *
+ * Ensure that no path leaks a UCollator.
+ */
+UCollator *
+pg_ucol_open(const char *loc_str)
+{
+ UCollator *collator;
+ UErrorCode status;
+ const char *orig_str = loc_str;
+ char *fixed_str;
+
+ /*
+ * Must never open default collator, because it depends on the environment
+ * and may change at any time. Should not happen, but check here to catch
+ * bugs that might be hard to catch otherwise.
+ *
+ * NB: the default collator is not the same as the collator for the root
+ * locale. The root locale may be specified as the empty string, "und", or
+ * "root". The default collator is opened by passing NULL to ucol_open().
+ */
+ if (loc_str == NULL)
+ elog(ERROR, "opening default collator is not supported");
+
+ fixed_str = fix_icu_locale_str(loc_str);
+ if (fixed_str)
+ loc_str = fixed_str;
+
status = U_ZERO_ERROR;
collator = ucol_open(loc_str, &status);
if (U_FAILURE(status))
@@ -456,6 +478,37 @@ pg_ucol_open(const char *loc_str)
return collator;
}
+/*
+ * Wrapper around ucasemap_open() to handle API differences for older ICU
+ * versions.
+ *
+ * Additionally makes sure we get the right options for case folding.
+ */
+static UCaseMap *
+pg_ucasemap_open(const char *loc_str)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ UCaseMap *casemap;
+ const char *orig_str = loc_str;
+ char *fixed_str;
+
+ fixed_str = fix_icu_locale_str(loc_str);
+ if (fixed_str)
+ loc_str = fixed_str;
+
+ casemap = ucasemap_open(loc_str, icu_foldcase_options(loc_str), &status);
+ if (U_FAILURE(status))
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not open casemap for locale \"%s\": %s",
+ orig_str, u_errorName(status)));
+
+ if (fixed_str != NULL)
+ pfree(fixed_str);
+
+ return casemap;
+}
+
/*
* Create a UCollator with the given locale string and rules.
*
@@ -528,80 +581,76 @@ static size_t
strlower_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToLower, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToLower(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strToLower, dest, destsize, src, srclen, locale);
}
static size_t
strtitle_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToTitle_default_BI, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToTitle(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strToTitle_default_BI, dest, destsize, src, srclen, locale);
}
static size_t
strupper_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToUpper, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToUpper(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strToUpper, dest, destsize, src, srclen, locale);
}
static size_t
strfold_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strFoldCase_default, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+ if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8FoldCase(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ (errmsg("case conversion failed: %s", u_errorName(status))));
+ return needed;
+ }
+ else
+ return icu_convert_case_no_utf8(u_strFoldCase_default, dest, destsize, src, srclen, locale);
}
/*
@@ -829,8 +878,28 @@ icu_from_uchar(char *dest, size_t destsize, const UChar *buff_uchar, int32_t len
}
static int32_t
-icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source, int32_t len_source)
+icu_convert_case_no_utf8(ICU_Convert_Func func, char *dest, size_t destsize,
+ const char *src, ssize_t srclen, pg_locale_t locale)
+{
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(func, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+}
+
+static int32_t
+icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source, int32_t len_source)
{
UErrorCode status;
int32_t len_dest;
@@ -870,10 +939,17 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
UErrorCode *pErrorCode)
+{
+ return u_strFoldCase(dest, destCapacity, src, srcLength,
+ icu_foldcase_options(locale), pErrorCode);
+}
+
+static int32_t
+icu_foldcase_options(const char *locale)
{
uint32 options = U_FOLD_CASE_DEFAULT;
char lang[3];
- UErrorCode status;
+ UErrorCode status = U_ZERO_ERROR;
/*
* Unlike the ICU APIs for lowercasing, titlecasing, and uppercasing, case
@@ -881,7 +957,6 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
* option relevant to Turkic languages 'az' and 'tr'; check for those
* languages to enable the option.
*/
- status = U_ZERO_ERROR;
uloc_getLanguage(locale, lang, 3, &status);
if (U_SUCCESS(status))
{
@@ -893,8 +968,7 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
options = U_FOLD_CASE_EXCLUDE_SPECIAL_I;
}
- return u_strFoldCase(dest, destCapacity, src, srcLength,
- options, pErrorCode);
+ return options;
}
/*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 86016b9344e..a4995e046aa 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -21,6 +21,7 @@
#undef U_SHOW_CPLUSPLUS_HEADER_API
#define U_SHOW_CPLUSPLUS_HEADER_API 0
#include <unicode/ucol.h>
+#include <unicode/ucasemap.h>
#endif
/* use for libc locale names */
@@ -168,6 +169,7 @@ struct pg_locale_struct
const char *locale;
UCollator *ucol;
locale_t lt;
+ UCaseMap *ucasemap;
} icu;
#endif
};
--
2.47.3
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-01-03 00:40 Andreas Karlsson <[email protected]>
parent: Andreas Karlsson <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Andreas Karlsson @ 2026-01-03 00:40 UTC (permalink / raw)
To: zengman <[email protected]>; Jeff Davis <[email protected]>; pgsql-hackers
Hi,
Here is a version 4 of the patch which uses the fact that we have method
tables to remove one level of indirection. I am not sure the extra lines
of codes are worth it but on the other hand despite 40 more lines the
code became easier to read to me. What do you think?
Andreas
Attachments:
[text/x-patch] v4-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch (15.4K, 2-v4-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch)
download | inline diff:
From 71909bf31f5ce803507bcbf834885648a8d6d174 Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Tue, 17 Dec 2024 22:47:00 +0100
Subject: [PATCH v4] Use optimized versions of ICU case conversion for UTF-8
Instead of converting to and from UChar when doing case conversions we
use the UTF-8 versions of the functions. This can give a signficant
speedup, 30-40%, on short to medium length strings.
The only cost we incur is that we have to allocate a casemap object on
locale initialization for UTF-8 databases but the object is realtively
small and the assumption is that most users will at some point want to
run case conversion functions.
While at it we also remove some duplication in the non-UTF-8 code.
---
src/backend/utils/adt/pg_locale_icu.c | 300 ++++++++++++++++++--------
src/include/utils/pg_locale.h | 2 +
2 files changed, 208 insertions(+), 94 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index de80642f9dc..9f283d4a334 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -52,6 +52,7 @@ extern pg_locale_t create_pg_locale_icu(Oid collid, MemoryContext context);
#ifdef USE_ICU
extern UCollator *pg_ucol_open(const char *loc_str);
+static UCaseMap *pg_ucasemap_open(const char *loc_str);
static size_t strlower_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
@@ -61,6 +62,14 @@ static size_t strupper_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
static size_t strfold_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
+static size_t strlower_icu_utf8(char *dest, size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
+static size_t strtitle_icu_utf8(char *dest, size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
+static size_t strupper_icu_utf8(char *dest, size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
+static size_t strfold_icu_utf8(char *dest, size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
static size_t downcase_ident_icu(char *dst, size_t dstsize, const char *src,
ssize_t srclen, pg_locale_t locale);
static int strncoll_icu(const char *arg1, ssize_t len1,
@@ -111,9 +120,12 @@ static size_t icu_from_uchar(char *dest, size_t destsize,
const UChar *buff_uchar, int32_t len_uchar);
static void icu_set_collation_attributes(UCollator *collator, const char *loc,
UErrorCode *status);
-static int32_t icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source,
- int32_t len_source);
+static int32_t icu_convert_case(ICU_Convert_Func func, char *dest,
+ size_t destsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
+static int32_t icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source,
+ int32_t len_source);
static int32_t u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
@@ -140,6 +152,8 @@ tolower_icu(pg_wchar wc, pg_locale_t locale)
return u_tolower(wc);
}
+static int32_t icu_foldcase_options(const char *locale);
+
static const struct collate_methods collate_methods_icu = {
.strncoll = strncoll_icu,
.strnxfrm = strnxfrm_icu,
@@ -245,6 +259,27 @@ static const struct ctype_methods ctype_methods_icu = {
.wc_tolower = tolower_icu,
};
+static const struct ctype_methods ctype_methods_icu_utf8 = {
+ .strlower = strlower_icu_utf8,
+ .strtitle = strtitle_icu_utf8,
+ .strupper = strupper_icu_utf8,
+ .strfold = strfold_icu_utf8,
+ .downcase_ident = downcase_ident_icu,
+ .wc_isdigit = wc_isdigit_icu,
+ .wc_isalpha = wc_isalpha_icu,
+ .wc_isalnum = wc_isalnum_icu,
+ .wc_isupper = wc_isupper_icu,
+ .wc_islower = wc_islower_icu,
+ .wc_isgraph = wc_isgraph_icu,
+ .wc_isprint = wc_isprint_icu,
+ .wc_ispunct = wc_ispunct_icu,
+ .wc_isspace = wc_isspace_icu,
+ .wc_isxdigit = wc_isxdigit_icu,
+ .wc_iscased = wc_iscased_icu,
+ .wc_toupper = toupper_icu,
+ .wc_tolower = tolower_icu,
+};
+
/*
* ICU still depends on libc for compatibility with certain historical
* behavior for single-byte encodings. See downcase_ident_icu().
@@ -347,10 +382,16 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
result->collate_is_c = false;
result->ctype_is_c = false;
if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ result->icu.ucasemap = pg_ucasemap_open(iculocstr);
result->collate = &collate_methods_icu_utf8;
+ result->ctype = &ctype_methods_icu_utf8;
+ }
else
+ {
result->collate = &collate_methods_icu;
- result->ctype = &ctype_methods_icu;
+ result->ctype = &ctype_methods_icu;
+ }
return result;
#else
@@ -366,41 +407,18 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
#ifdef USE_ICU
/*
- * Wrapper around ucol_open() to handle API differences for older ICU
- * versions.
- *
- * Ensure that no path leaks a UCollator.
+ * In ICU versions 54 and earlier, "und" is not a recognized spelling of the
+ * root locale. If the first component of the locale is "und", replace with
+ * "root" before opening.
*/
-UCollator *
-pg_ucol_open(const char *loc_str)
+static char *
+fix_icu_locale_str(const char *loc_str)
{
- UCollator *collator;
- UErrorCode status;
- const char *orig_str = loc_str;
- char *fixed_str = NULL;
-
- /*
- * Must never open default collator, because it depends on the environment
- * and may change at any time. Should not happen, but check here to catch
- * bugs that might be hard to catch otherwise.
- *
- * NB: the default collator is not the same as the collator for the root
- * locale. The root locale may be specified as the empty string, "und", or
- * "root". The default collator is opened by passing NULL to ucol_open().
- */
- if (loc_str == NULL)
- elog(ERROR, "opening default collator is not supported");
-
- /*
- * In ICU versions 54 and earlier, "und" is not a recognized spelling of
- * the root locale. If the first component of the locale is "und", replace
- * with "root" before opening.
- */
if (U_ICU_VERSION_MAJOR_NUM < 55)
{
char lang[ULOC_LANG_CAPACITY];
+ UErrorCode status = U_ZERO_ERROR;
- status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
@@ -413,15 +431,49 @@ pg_ucol_open(const char *loc_str)
if (strcmp(lang, "und") == 0)
{
const char *remainder = loc_str + strlen("und");
+ char *fixed_str;
fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
strcpy(fixed_str, "root");
strcat(fixed_str, remainder);
- loc_str = fixed_str;
+ return fixed_str;
}
}
+ return NULL;
+}
+
+/*
+ * Wrapper around ucol_open() to handle API differences for older ICU
+ * versions.
+ *
+ * Ensure that no path leaks a UCollator.
+ */
+UCollator *
+pg_ucol_open(const char *loc_str)
+{
+ UCollator *collator;
+ UErrorCode status;
+ const char *orig_str = loc_str;
+ char *fixed_str;
+
+ /*
+ * Must never open default collator, because it depends on the environment
+ * and may change at any time. Should not happen, but check here to catch
+ * bugs that might be hard to catch otherwise.
+ *
+ * NB: the default collator is not the same as the collator for the root
+ * locale. The root locale may be specified as the empty string, "und", or
+ * "root". The default collator is opened by passing NULL to ucol_open().
+ */
+ if (loc_str == NULL)
+ elog(ERROR, "opening default collator is not supported");
+
+ fixed_str = fix_icu_locale_str(loc_str);
+ if (fixed_str)
+ loc_str = fixed_str;
+
status = U_ZERO_ERROR;
collator = ucol_open(loc_str, &status);
if (U_FAILURE(status))
@@ -456,6 +508,37 @@ pg_ucol_open(const char *loc_str)
return collator;
}
+/*
+ * Wrapper around ucasemap_open() to handle API differences for older ICU
+ * versions.
+ *
+ * Additionally makes sure we get the right options for case folding.
+ */
+static UCaseMap *
+pg_ucasemap_open(const char *loc_str)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ UCaseMap *casemap;
+ const char *orig_str = loc_str;
+ char *fixed_str;
+
+ fixed_str = fix_icu_locale_str(loc_str);
+ if (fixed_str)
+ loc_str = fixed_str;
+
+ casemap = ucasemap_open(loc_str, icu_foldcase_options(loc_str), &status);
+ if (U_FAILURE(status))
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not open casemap for locale \"%s\": %s",
+ orig_str, u_errorName(status)));
+
+ if (fixed_str != NULL)
+ pfree(fixed_str);
+
+ return casemap;
+}
+
/*
* Create a UCollator with the given locale string and rules.
*
@@ -528,80 +611,84 @@ static size_t
strlower_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToLower, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
+ return icu_convert_case(u_strToLower, dest, destsize, src, srclen, locale);
}
static size_t
strtitle_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToTitle_default_BI, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
+ return icu_convert_case(u_strToTitle_default_BI, dest, destsize, src, srclen, locale);
}
static size_t
strupper_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
-
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strToUpper, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
-
- return result_len;
+ return icu_convert_case(u_strToUpper, dest, destsize, src, srclen, locale);
}
static size_t
strfold_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- int32_t len_uchar;
- int32_t len_conv;
- UChar *buff_uchar;
- UChar *buff_conv;
- size_t result_len;
+ return icu_convert_case(u_strFoldCase_default, dest, destsize, src, srclen, locale);
+}
- len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
- len_conv = icu_convert_case(u_strFoldCase_default, locale,
- &buff_conv, buff_uchar, len_uchar);
- result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
- pfree(buff_uchar);
- pfree(buff_conv);
+static size_t
+strlower_icu_utf8(char *dest, size_t destsize, const char *src, ssize_t srclen,
+ pg_locale_t locale)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
- return result_len;
+ needed = ucasemap_utf8ToLower(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ errmsg("case conversion failed: %s", u_errorName(status)));
+ return needed;
+}
+
+static size_t
+strtitle_icu_utf8(char *dest, size_t destsize, const char *src, ssize_t srclen,
+ pg_locale_t locale)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
+
+ needed = ucasemap_utf8ToTitle(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ errmsg("case conversion failed: %s", u_errorName(status)));
+ return needed;
+}
+
+static size_t
+strupper_icu_utf8(char *dest, size_t destsize, const char *src, ssize_t srclen,
+ pg_locale_t locale)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
+
+ needed = ucasemap_utf8ToUpper(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ errmsg("case conversion failed: %s", u_errorName(status)));
+ return needed;
+}
+
+static size_t
+strfold_icu_utf8(char *dest, size_t destsize, const char *src, ssize_t srclen,
+ pg_locale_t locale)
+{
+ UErrorCode status = U_ZERO_ERROR;
+ int32_t needed;
+
+ needed = ucasemap_utf8FoldCase(locale->icu.ucasemap, dest, destsize, src, srclen, &status);
+ if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))
+ ereport(ERROR,
+ errmsg("case conversion failed: %s", u_errorName(status)));
+ return needed;
}
/*
@@ -829,8 +916,28 @@ icu_from_uchar(char *dest, size_t destsize, const UChar *buff_uchar, int32_t len
}
static int32_t
-icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
- UChar **buff_dest, UChar *buff_source, int32_t len_source)
+icu_convert_case(ICU_Convert_Func func, char *dest, size_t destsize,
+ const char *src, ssize_t srclen, pg_locale_t locale)
+{
+ int32_t len_uchar;
+ int32_t len_conv;
+ UChar *buff_uchar;
+ UChar *buff_conv;
+ size_t result_len;
+
+ len_uchar = icu_to_uchar(&buff_uchar, src, srclen);
+ len_conv = icu_convert_case_uchar(func, locale, &buff_conv,
+ buff_uchar, len_uchar);
+ result_len = icu_from_uchar(dest, destsize, buff_conv, len_conv);
+ pfree(buff_uchar);
+ pfree(buff_conv);
+
+ return result_len;
+}
+
+static int32_t
+icu_convert_case_uchar(ICU_Convert_Func func, pg_locale_t mylocale,
+ UChar **buff_dest, UChar *buff_source, int32_t len_source)
{
UErrorCode status;
int32_t len_dest;
@@ -870,10 +977,17 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale,
UErrorCode *pErrorCode)
+{
+ return u_strFoldCase(dest, destCapacity, src, srcLength,
+ icu_foldcase_options(locale), pErrorCode);
+}
+
+static int32_t
+icu_foldcase_options(const char *locale)
{
uint32 options = U_FOLD_CASE_DEFAULT;
char lang[3];
- UErrorCode status;
+ UErrorCode status = U_ZERO_ERROR;
/*
* Unlike the ICU APIs for lowercasing, titlecasing, and uppercasing, case
@@ -881,7 +995,6 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
* option relevant to Turkic languages 'az' and 'tr'; check for those
* languages to enable the option.
*/
- status = U_ZERO_ERROR;
uloc_getLanguage(locale, lang, 3, &status);
if (U_SUCCESS(status))
{
@@ -893,8 +1006,7 @@ u_strFoldCase_default(UChar *dest, int32_t destCapacity,
options = U_FOLD_CASE_EXCLUDE_SPECIAL_I;
}
- return u_strFoldCase(dest, destCapacity, src, srcLength,
- options, pErrorCode);
+ return options;
}
/*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index b1ee5fb0ef5..465f170ba79 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -21,6 +21,7 @@
#undef U_SHOW_CPLUSPLUS_HEADER_API
#define U_SHOW_CPLUSPLUS_HEADER_API 0
#include <unicode/ucol.h>
+#include <unicode/ucasemap.h>
#endif
/* use for libc locale names */
@@ -168,6 +169,7 @@ struct pg_locale_struct
const char *locale;
UCollator *ucol;
locale_t lt;
+ UCaseMap *ucasemap;
} icu;
#endif
};
--
2.47.3
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-01-03 03:05 =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
parent: Andreas Karlsson <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: =?ISO-8859-1?B?emVuZ21hbg==?= @ 2026-01-03 03:05 UTC (permalink / raw)
To: =?ISO-8859-1?B?QW5kcmVhcyBLYXJsc3Nvbg==?= <[email protected]>; =?ISO-8859-1?B?SmVmZiBEYXZpcw==?= <[email protected]>; pgsql-hackers
> Here is a version 4 of the patch which uses the fact that we have method
> tables to remove one level of indirection. I am not sure the extra lines
> of codes are worth it but on the other hand despite 40 more lines the
> code became easier to read to me. What do you think?
I don't have any major objections, but I noticed a few minor details that might need a bit more tweaking.
`signficant` -> `significant`
`realtively` -> `relatively`
`if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` -> `if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)`
--
Regards,
Man Zeng
www.openhalo.org
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-03-12 04:00 Alexander Lakhin <[email protected]>
parent: =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Alexander Lakhin @ 2026-03-12 04:00 UTC (permalink / raw)
To: Jeff Davis <[email protected]>; Andreas Karlsson <[email protected]>; zengman <[email protected]>; pgsql-hackers
Hello Jeff,
07.01.2026 00:10, Jeff Davis wrote:
> Committed, thank you!
I've discovered that starting from c4ff35f10, the following query:
CREATE COLLATION c (provider = icu, locale = 'icu_something');
makes asan detect (maybe dubious, but still..) stack-buffer-overflow:
==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd386d4e63 at pc 0x650cd7972a76 bp 0x7ffd386d4e00
sp 0x7ffd386d45a8
...
Address 0x7ffd386d4e63 is located in stack of thread T0 at offset 67 in frame
#0 0x650cd86962ef in foldcase_options (.../usr/local/pgsql/bin/postgres+0x12322ef) (BuildId:
e441a9634858193e7358e5901e7948606ff5b1b1)
This frame has 2 object(s):
[48, 52) 'status' (line 993)
[64, 67) 'lang' (line 992) <== Memory access at offset 67 overflows this variable
I use a build made with:
CC=gcc-13 CPPFLAGS="-fsanitize=address" LDFLAGS="-fsanitize=address -static-libasan" ./configure --with-icu ...
Could you please have a look?
Best regards,
Alexander
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-04-01 00:46 Andreas Karlsson <[email protected]>
parent: Alexander Lakhin <[email protected]>
0 siblings, 2 replies; 17+ messages in thread
From: Andreas Karlsson @ 2026-04-01 00:46 UTC (permalink / raw)
To: Alexander Lakhin <[email protected]>; Jeff Davis <[email protected]>; zengman <[email protected]>; pgsql-hackers
On 3/12/26 5:00 AM, Alexander Lakhin wrote:
> I've discovered that starting from c4ff35f10, the following query:
> CREATE COLLATION c (provider = icu, locale = 'icu_something');
>
> makes asan detect (maybe dubious, but still..) stack-buffer-overflow:
> ==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address
> 0x7ffd386d4e63 at pc 0x650cd7972a76 bp 0x7ffd386d4e00 sp 0x7ffd386d45a8
> ...
> Address 0x7ffd386d4e63 is located in stack of thread T0 at offset 67 in
> frame
> #0 0x650cd86962ef in foldcase_options (.../usr/local/pgsql/bin/
> postgres+0x12322ef) (BuildId: e441a9634858193e7358e5901e7948606ff5b1b1)
>
> This frame has 2 object(s):
> [48, 52) 'status' (line 993)
> [64, 67) 'lang' (line 992) <== Memory access at offset 67 overflows
> this variable
>
> I use a build made with:
> CC=gcc-13 CPPFLAGS="-fsanitize=address" LDFLAGS="-fsanitize=address -
> static-libasan" ./configure --with-icu ...
>
> Could you please have a look?
Thanks for finding this!
Interestingly this bug seems like it would be there even before my
patch, but maybe something I did made it when moving code around made it
possible or easier to trigger. As far as I can tell the issue is that
uloc_getLanguage(locale, lang, 3, &status);
will populate lang with a string which is not zero terminated if the
language is 3 or more characters, e.g. "und". And for some reason which
I am not entirely strcmp("tr", {'u','n','d'}) can cause an overflow.
Maybe due to some optimization?
My proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for the
language like we do in fix_icu_locale_str() instead of trying to be
clever. An alternative would be to use strncmp("tr", lang, 3) but that
seems too clever for my taste in something which is not performance
critical. A third option would be to check for
U_STRING_NOT_TERMINATED_WARNING but I think that would just be
unnecessarily convoluted code.
I have attached my proposed fix.
Andreas
Attachments:
[text/x-patch] v1-0001-Fix-overrun-when-comparing-with-unterminated-ICU-.patch (1.3K, 2-v1-0001-Fix-overrun-when-comparing-with-unterminated-ICU-.patch)
download | inline diff:
From 9d9a13917f53de690d70dcfb62adb1f0c5acad2a Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Wed, 1 Apr 2026 02:39:09 +0200
Subject: [PATCH v1] Fix overrun when comparing with unterminated ICU language
string
When uloc_getLanguage() returns an unterminated string when the language
is too long to fit in our buffer, in this case 3 bytes. This could cause
a later strcmp() to read outside the buffer.
Since this is not a performance cirtical code path just increase the buffer
size to ULOC_LANG_CAPACITY to match the code on fix_icu_locale_str()
instead of trying to do something clever.
---
src/backend/utils/adt/pg_locale_icu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 5ad05fcd016..96d66dd4f8a 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -989,10 +989,10 @@ static int32_t
foldcase_options(const char *locale)
{
uint32 options = U_FOLD_CASE_DEFAULT;
- char lang[3];
+ char lang[ULOC_LANG_CAPACITY];
UErrorCode status = U_ZERO_ERROR;
- uloc_getLanguage(locale, lang, 3, &status);
+ uloc_getLanguage(locale, lang, ULOC_LANG_CAPACITY, &status);
if (U_SUCCESS(status))
{
/*
--
2.47.3
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-04-13 06:35 Andreas Karlsson <[email protected]>
parent: Andreas Karlsson <[email protected]>
1 sibling, 0 replies; 17+ messages in thread
From: Andreas Karlsson @ 2026-04-13 06:35 UTC (permalink / raw)
To: [email protected]; Alexander Lakhin <[email protected]>; Jeff Davis <[email protected]>; zengman <[email protected]>; pgsql-hackers
On 1 April 2026 02:46:23 CEST, Andreas Karlsson <[email protected]> wrote:
>My proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for the language like we do in fix_icu_locale_str() instead of trying to be clever. An alternative would be to use strncmp("tr", lang, 3) but that seems too clever for my taste in something which is not performance critical. A third option would be to check for U_STRING_NOT_TERMINATED_WARNING but I think that would just be unnecessarily convoluted code.
>
>I have attached my proposed fix.
Since it is likely I introduced or at least exposed this bug somehow I am adding this to the open items for PG 19.
Andreas
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-04-13 18:40 Jeff Davis <[email protected]>
parent: Andreas Karlsson <[email protected]>
1 sibling, 1 reply; 17+ messages in thread
From: Jeff Davis @ 2026-04-13 18:40 UTC (permalink / raw)
To: Andreas Karlsson <[email protected]>; Alexander Lakhin <[email protected]>; zengman <[email protected]>; pgsql-hackers
On Wed, 2026-04-01 at 02:46 +0200, Andreas Karlsson wrote:
> On 3/12/26 5:00 AM, Alexander Lakhin wrote:
> > I've discovered that starting from c4ff35f10, the following query:
> > CREATE COLLATION c (provider = icu, locale = 'icu_something');
> >
> > makes asan detect (maybe dubious, but still..) stack-buffer-
> > overflow:
> > ==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address
>
> My proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for
> the
> language like we do in fix_icu_locale_str() instead of trying to be
> clever.
Thank you both!
Committed with minor revisions:
* also check the status code, just to be sure
* backport to 18 where the original code was introduced
Regards,
Jeff Davis
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-04-14 00:20 Andreas Karlsson <[email protected]>
parent: Jeff Davis <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Andreas Karlsson @ 2026-04-14 00:20 UTC (permalink / raw)
To: Jeff Davis <[email protected]>; Alexander Lakhin <[email protected]>; zengman <[email protected]>; pgsql-hackers
On 4/13/26 20:40, Jeff Davis wrote:
> Thank you both!
Thanks!
> Committed with minor revisions:
>
> * also check the status code, just to be sure
If we do that shouldn't we also do the same in the other callsites in
initdb.c uloc_getLanguage()? Maybe something like the attached. Also I
wonder if maybe other ICU functions have the same risk.
Andreas
Attachments:
[text/x-patch] v1-0001-Always-check-for-untermianted-strings-when-callin.patch (1.2K, 2-v1-0001-Always-check-for-untermianted-strings-when-callin.patch)
download | inline diff:
From 86703cae627f2d1a12fecb3a6ab7fbc0f0511330 Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Tue, 14 Apr 2026 02:12:15 +0200
Subject: [PATCH v1] Always check for untermianted strings when calling
uloc_getLanguage()
---
src/bin/initdb/initdb.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 509f1114ef6..21f25915ab2 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2403,7 +2403,7 @@ icu_validate_locale(const char *loc_str)
/* validate that we can extract the language */
status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
pg_fatal("could not get language from locale \"%s\": %s",
loc_str, u_errorName(status));
@@ -2423,7 +2423,7 @@ icu_validate_locale(const char *loc_str)
status = U_ZERO_ERROR;
uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status === U_STRING_NOT_TERMINATED_WARNING)
continue;
if (strcmp(lang, otherlang) == 0)
--
2.43.0
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-04-14 00:28 Andreas Karlsson <[email protected]>
parent: Andreas Karlsson <[email protected]>
0 siblings, 1 reply; 17+ messages in thread
From: Andreas Karlsson @ 2026-04-14 00:28 UTC (permalink / raw)
To: Jeff Davis <[email protected]>; Alexander Lakhin <[email protected]>; zengman <[email protected]>; pgsql-hackers
On 4/14/26 02:20, Andreas Karlsson wrote:
> If we do that shouldn't we also do the same in the other callsites in
> initdb.c uloc_getLanguage()? Maybe something like the attached. Also I
> wonder if maybe other ICU functions have the same risk.
Now attached without a stupid typo.
Andreas
Attachments:
[text/x-patch] v2-0001-Always-check-for-untermianted-strings-when-callin.patch (1.2K, 2-v2-0001-Always-check-for-untermianted-strings-when-callin.patch)
download | inline diff:
From 24bb803005d0af0bb371f22d0b6ac20fb50bdc0d Mon Sep 17 00:00:00 2001
From: Andreas Karlsson <[email protected]>
Date: Tue, 14 Apr 2026 02:12:15 +0200
Subject: [PATCH v2] Always check for untermianted strings when calling
uloc_getLanguage()
---
src/bin/initdb/initdb.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 509f1114ef6..2ee834f0765 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2403,7 +2403,7 @@ icu_validate_locale(const char *loc_str)
/* validate that we can extract the language */
status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
pg_fatal("could not get language from locale \"%s\": %s",
loc_str, u_errorName(status));
@@ -2423,7 +2423,7 @@ icu_validate_locale(const char *loc_str)
status = U_ZERO_ERROR;
uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
continue;
if (strcmp(lang, otherlang) == 0)
--
2.43.0
^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Speed up ICU case conversion by using ucasemap_utf8To*()
@ 2026-04-14 21:49 Jeff Davis <[email protected]>
parent: Andreas Karlsson <[email protected]>
0 siblings, 0 replies; 17+ messages in thread
From: Jeff Davis @ 2026-04-14 21:49 UTC (permalink / raw)
To: Andreas Karlsson <[email protected]>; Alexander Lakhin <[email protected]>; zengman <[email protected]>; pgsql-hackers
On Tue, 2026-04-14 at 02:28 +0200, Andreas Karlsson wrote:
> On 4/14/26 02:20, Andreas Karlsson wrote:
> > If we do that shouldn't we also do the same in the other callsites
> > in
> > initdb.c uloc_getLanguage()? Maybe something like the attached.
> > Also I
> > wonder if maybe other ICU functions have the same risk.
Committed, thank you.
Regards,
Jeff Davis
^ permalink raw reply [nested|flat] 17+ messages in thread
end of thread, other threads:[~2026-04-14 21:49 UTC | newest]
Thread overview: 17+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-12-20 05:20 Speed up ICU case conversion by using ucasemap_utf8To*() Andreas Karlsson <[email protected]>
2024-12-20 19:24 ` Jeff Davis <[email protected]>
2025-12-31 00:18 ` Andreas Karlsson <[email protected]>
2025-12-31 02:36 ` =?gb18030?B?emVuZ21hbg==?= <[email protected]>
2025-12-31 15:40 ` Andreas Karlsson <[email protected]>
2026-01-03 00:40 ` Andreas Karlsson <[email protected]>
2026-01-03 03:05 ` =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
2026-03-12 04:00 ` Alexander Lakhin <[email protected]>
2026-04-01 00:46 ` Andreas Karlsson <[email protected]>
2026-04-13 06:35 ` Andreas Karlsson <[email protected]>
2026-04-13 18:40 ` Jeff Davis <[email protected]>
2026-04-14 00:20 ` Andreas Karlsson <[email protected]>
2026-04-14 00:28 ` Andreas Karlsson <[email protected]>
2026-04-14 21:49 ` Jeff Davis <[email protected]>
2025-03-17 06:46 ` vignesh C <[email protected]>
2025-03-29 18:50 ` Andres Freund <[email protected]>
2025-03-30 01:18 ` vignesh C <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox