Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rAKPQ-0066HP-LN for pgsql-sql@arkaria.postgresql.org; Tue, 05 Dec 2023 01:39:36 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rAKPP-001mZD-6T for pgsql-sql@arkaria.postgresql.org; Tue, 05 Dec 2023 01:39:35 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rAKPO-001mZ5-PY for pgsql-sql@lists.postgresql.org; Tue, 05 Dec 2023 01:39:34 +0000 Received: from mail-wm1-x32e.google.com ([2a00:1450:4864:20::32e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rAKPL-00A8l4-Dk for pgsql-sql@lists.postgresql.org; Tue, 05 Dec 2023 01:39:33 +0000 Received: by mail-wm1-x32e.google.com with SMTP id 5b1f17b1804b1-40b595bf5d2so54689185e9.2 for ; Mon, 04 Dec 2023 17:39:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gisticinc-com.20230601.gappssmtp.com; s=20230601; t=1701740370; x=1702345170; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=LRTBSnrZT8TyvVZxIF0pOsfAddxe9hBQK8D1xxQq6ZU=; b=Sr5tkzYg3T93GzqPW/S1WmRsBliLeyPTQKpA3p5kLeK1yaS08RHEq6qwGlz2j2LZUK 1UMq272Au7v76fFeNFYwRYNXVME4eDEASD2vkeKW0lVscUmHKT9kWbokDA7osHgbh8S2 Q12hUtEK+lTyYzS/foL4ezjqSXmRkp7wkq2pdLxlFZSm9Bt479+RaF/QYy/6UscNNc2h 5QVTuNplzwZp3jRf+K1SwwTwR2TaPsaa26/FU2aGCWtZp62bCuRNRKvTcnN3Lezk2jNv eV6MH0RmPWbGDfoqWnO5M952AkqT/ccK9KXU5q+WYFttdJG988dE1tgWjw1AnrGFEbKK bYeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701740370; x=1702345170; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=LRTBSnrZT8TyvVZxIF0pOsfAddxe9hBQK8D1xxQq6ZU=; b=DB4mzzchCzIooGKuhAvlmlbV69Wxf34lCLhu94/VeMDbKyJQAIZMqeKfpVSS2lkKOC CvXhroXPuq+9ytsRw9JN7N/gc/As0cEI8Vd0TmEwdBTbtGi/nuOg/Xu/UYShw4AOSVnA ocw8VMfiPC6FaNe0uHJm0ZyqCsOxXfqVfeSGdMNzxY8xpRtotnSqhuDaHnrKFmRVLGh7 OAhqVg3jJ8E8G3y0j8ZqVJBU9SazRxAZT3P+DL8BMX6UeJIMUUCnHfdtCAuRvXfpy9M8 tO0dp/GpmgkkaUDHtpgVv/RuElrVdl8bCW+VN3pNKXt8JrFLR9GKyaQVfi11dyhrFq8R 6wiQ== X-Gm-Message-State: AOJu0YxmIqCdn1c4FrjC6M2BT+TlO/p8tBLG0orwHPurQQQY3al8+8/P XMN5+PHvJ4TH3feCgMCg6OqhJgB9OnYWHNGgzsFWDw== X-Google-Smtp-Source: AGHT+IHb5AROR/3Sy1gAP87M3nKkOJ39lyx14ZWHxWUp3Tw4HRODKpG+rjGzyQT5PufGLR3lsCZvgiFAjWWT4sGMKpk= X-Received: by 2002:a7b:c7cf:0:b0:40b:5e21:c5e7 with SMTP id z15-20020a7bc7cf000000b0040b5e21c5e7mr14152wmk.181.1701740370139; Mon, 04 Dec 2023 17:39:30 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Bo Guo Date: Mon, 4 Dec 2023 18:39:18 -0700 Message-ID: Subject: Re: Overcoming Initcap Function limitations? To: Steve Midgley Cc: Greg Sabino Mullane , pgsql-sql@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000004334c8060bb94ee1" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000004334c8060bb94ee1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thank you, Steve and Greg! Your suggestions open up new potentials for me to explore. At this moment, I lean towards normalizing the database column values in upper case, thereby out-sourcing the case-changing responsibility to the front end. I would love to hear from your thoughts on this pattern. Cheers, Bo On Mon, Dec 4, 2023 at 11:39=E2=80=AFAM Steve Midgley = wrote: > > > On Mon, Dec 4, 2023 at 10:09=E2=80=AFAM Greg Sabino Mullane > wrote: > >> It's not clear exactly what you are trying to achieve, but you can use >> Postgres' built-in text searching system to exclude stopwords. For examp= le: >> >> CREATE FUNCTION initcap_realword(myword TEXT) >> returns TEXT language SQL AS >> $$ >> SELECT CASE WHEN length(to_tsvector(myword)) < 1 >> THEN myword ELSE initcap(myword) END; >> $$; >> >> You could extend that to multi-word strings with a little effort. >> However, knowing that macdonald should be MacDonald requires a lot more >> intelligence than is provided by any Postgres built-in system or extensi= on >> that I know of. What you are looking at is the field of science known as >> Natural Language Processing, which can get very complex very quickly. Bu= t >> for a Postgres answer, you might combine plpython3u with spacy ( >> https://spacy.io/usage/spacy-101). >> >> Cheers, >> Greg >> >> I've been having some pretty good experiences with "hard" text > transformations such as correct capitalization of names like MacDonald > using GPT 3.5 Turbo API which is pretty cheap for the volume of data I've > been working with.. Seems like Spacy might do similar things, and if it c= an > be run locally, might be much cheaper than a rental API.. > > Steve > --0000000000004334c8060bb94ee1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thank you, Steve and Greg!=C2=A0=C2=A0

=
Your suggestions open up new potentials for me to explore.=C2=A0= At this moment, I lean towards normalizing the database column values in u= pper case, thereby out-sourcing the case-changing responsibility to the fro= nt end.=C2=A0 I would love to hear from your thoughts on this pattern.

Cheers,

Bo=C2=A0
=
On Mon= , Dec 4, 2023 at 11:39=E2=80=AFAM Steve Midgley <science@misuse.org> wrote:


On M= on, Dec 4, 2023 at 10:09=E2=80=AFAM Greg Sabino Mullane <htamfids@gmail.com> wrote:<= br>
It's not clear exactly what you are trying to achieve, but you can use= Postgres' built-in text searching system to exclude stopwords. For exa= mple:

CREATE FUNCTION initcap_realword(myword TEXT)
=C2=A0 returns TEXT language SQL AS=C2=A0
$$
SELE= CT CASE WHEN length(to_tsvector(myword)) < 1
=C2=A0 THEN mywor= d=C2=A0ELSE initcap(myword) END;
$$;

You could extend that to multi-word strings with a little effort. However,= knowing that macdonald should be MacDonald requires a lot more intelligenc= e than is provided by any Postgres built-in system or extension that I know= of. What you are looking at is the field of science known as Natural Langu= age Processing, which can get very complex very quickly. But for a Postgres= answer, you might combine plpython3u with spacy (https://spacy.io/usage/spacy-101)= .

Cheers,
Greg

I've been having some pretty good experiences with &= quot;hard" text transformations such as correct capitalization of name= s like MacDonald using GPT 3.5 Turbo API which is pretty cheap for the volu= me of data I've been working with.. Seems like Spacy might do similar t= hings, and if it can be run locally, might be much cheaper than a rental AP= I..

Steve
--0000000000004334c8060bb94ee1--