Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rAMns-006JYh-FN for pgsql-sql@arkaria.postgresql.org; Tue, 05 Dec 2023 04:13:00 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rAMno-0028wv-Pu for pgsql-sql@arkaria.postgresql.org; Tue, 05 Dec 2023 04:12:56 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rAMno-0028wm-Ap for pgsql-sql@lists.postgresql.org; Tue, 05 Dec 2023 04:12:56 +0000 Received: from mail-lf1-f43.google.com ([209.85.167.43]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rAMnh-008r6m-4Q for pgsql-sql@lists.postgresql.org; Tue, 05 Dec 2023 04:12:54 +0000 Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-50be3eed85aso3884476e87.2 for ; Mon, 04 Dec 2023 20:12:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701749567; x=1702354367; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=lU54vIu6EqCRwh87r1pGtsJ+XYfMEyzCNj9SmF+HhVE=; b=e78jLD7uqmPVIGRgTiKzutGajQ4w6n2DUgRFSFzezLYUenY8pI2orvxZJOm9zDoSd5 sg31CGlIkckBJadLl9YeXESOqby9YtCr7N5Wa/G4uA6bU7u6FZt8jz0AVJjDk1zDs6sq ssDfZXnlHkbqqDN2hAJ0J8Kh1G+KVoTlHcg4BZbwZsMb5H/6yZIakXZHl1mNudVb3YOB H3DjgctjExclfmY9hQnn78KOX+zbiU3jxopn4LuSqLnegn6CHCsphxTbhvf/JpCOHm0m WFQ9miNOYiGptG/+h0Db9ZXsnJsFw60IG+eeE93RFlCHVT/+k86sDnGEzOBc4OgazDCD OxFw== X-Gm-Message-State: AOJu0YxBfzsskzbImUxLg8cgaK/c4a8JRE4pmRTsQz8TYw50b1P2o+UR /nlnjUYAM6oI846aCk9eghB8R6TjxzTKrA== X-Google-Smtp-Source: AGHT+IGjmFYxYLZAQ7gocgpdlWhUVfph+VfNn9ZhyZtBf+emV9ZUGEMvCbX5A+4yPX2n2iw96qtqLw== X-Received: by 2002:a05:6512:2250:b0:50c:a39:ee2a with SMTP id i16-20020a056512225000b0050c0a39ee2amr119265lfu.55.1701749566617; Mon, 04 Dec 2023 20:12:46 -0800 (PST) Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com. [209.85.221.52]) by smtp.gmail.com with ESMTPSA id g14-20020a50ee0e000000b0054ccac03945sm519211eds.12.2023.12.04.20.12.46 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Dec 2023 20:12:46 -0800 (PST) Received: by mail-wr1-f52.google.com with SMTP id ffacd0b85a97d-332c46d5988so4072662f8f.1 for ; Mon, 04 Dec 2023 20:12:46 -0800 (PST) X-Received: by 2002:adf:fa0c:0:b0:333:2fd2:5d1a with SMTP id m12-20020adffa0c000000b003332fd25d1amr3844255wrr.76.1701749565962; Mon, 04 Dec 2023 20:12:45 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Steve Midgley Date: Mon, 4 Dec 2023 20:12:34 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Overcoming Initcap Function limitations? To: Bo Guo Cc: Greg Sabino Mullane , pgsql-sql Content-Type: multipart/alternative; boundary="0000000000006044c1060bbb72c9" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000006044c1060bbb72c9 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Dec 4, 2023, 5:39 PM Bo Guo wrote: > Thank you, Steve and Greg! > > Your suggestions open up new potentials for me to explore. At this > moment, I lean towards normalizing the database column values in upper > case, thereby out-sourcing the case-changing responsibility to the front > end. I would love to hear from your thoughts on this pattern. > > Cheers, > > Bo > > On Mon, Dec 4, 2023 at 11:39=E2=80=AFAM Steve Midgley wrote: > >> >> >> On Mon, Dec 4, 2023 at 10:09=E2=80=AFAM Greg Sabino Mullane >> wrote: >> >>> It's not clear exactly what you are trying to achieve, but you can use >>> Postgres' built-in text searching system to exclude stopwords. For exam= ple: >>> >>> CREATE FUNCTION initcap_realword(myword TEXT) >>> returns TEXT language SQL AS >>> $$ >>> SELECT CASE WHEN length(to_tsvector(myword)) < 1 >>> THEN myword ELSE initcap(myword) END; >>> $$; >>> >>> You could extend that to multi-word strings with a little effort. >>> However, knowing that macdonald should be MacDonald requires a lot more >>> intelligence than is provided by any Postgres built-in system or extens= ion >>> that I know of. What you are looking at is the field of science known a= s >>> Natural Language Processing, which can get very complex very quickly. B= ut >>> for a Postgres answer, you might combine plpython3u with spacy ( >>> https://spacy.io/usage/spacy-101). >>> >>> Cheers, >>> Greg >>> >>> I've been having some pretty good experiences with "hard" text >> transformations such as correct capitalization of names like MacDonald >> using GPT 3.5 Turbo API which is pretty cheap for the volume of data I'v= e >> been working with.. Seems like Spacy might do similar things, and if it = can >> be run locally, might be much cheaper than a rental API.. >> >> Steve >> > It really depends on the use case. If your users are happy with all uppercase, that seems like a great solution: fast, cheap, and reliable! Note that this list asks posters to add their reply comments at the bottom of each message, rather than the type which is more typical.. Welcome! Steve > --0000000000006044c1060bbb72c9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Mon, Dec 4, 2023, 5:39 PM Bo Guo <bo.guo@gisticinc.com> wrote:
Thank you, Steve and Greg!= =C2=A0=C2=A0

Your suggestions open up new potentia= ls for me to explore.=C2=A0 At this moment, I lean towards normalizing the = database column values in upper case, thereby out-sourcing the case-changin= g responsibility to the front end.=C2=A0 I would love to hear from your tho= ughts on this pattern.

Cheers,

Bo=C2=A0

On Mon, Dec 4, 2023 at 11:39=E2=80=AFAM Steve Midgley = <science@misuse.org> wrote:


On Mon, Dec 4, = 2023 at 10:09=E2=80=AFAM Greg Sabino Mullane <htamfids@gmail.com>= wrote:
It's not clear exactly what you are trying to achieve, but you= can use Postgres' built-in text searching system to exclude stopwords.= For example:

CREATE FUNCTION initcap_realword(myword TE= XT)
=C2=A0 returns TEXT language SQL AS=C2=A0
$$
<= div>SELECT CASE WHEN length(to_tsvector(myword)) < 1
=C2=A0 TH= EN myword=C2=A0ELSE initcap(myword) END;
$$;

You could extend that to multi-word strings with a little effort. = However, knowing that macdonald should be MacDonald requires a lot more int= elligence than is provided by any Postgres built-in system or extension tha= t I know of. What you are looking at is the field of science known as Natur= al Language Processing, which can get very complex very quickly. But for a = Postgres answer, you might combine plpython3u with spacy (https://sp= acy.io/usage/spacy-101).

Cheers,
Gre= g

I've been having some pre= tty good experiences with "hard" text transformations such as cor= rect capitalization of names like MacDonald using GPT 3.5 Turbo API which i= s pretty cheap for the volume of data I've been working with.. Seems li= ke Spacy might do similar things, and if it can be run locally, might be mu= ch cheaper than a rental API..

Steve
=
<= br>
It really depends on the use case. If your users= are happy with all uppercase, that seems like a great solution: fast, chea= p, and reliable!=C2=A0

N= ote that this list asks posters to add their reply comments at the bottom o= f each message, rather than the type which is more typical.. Welcome!=C2=A0=

Steve=C2=A0
--0000000000006044c1060bbb72c9--