Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rADqw-005V17-TW for pgsql-sql@arkaria.postgresql.org; Mon, 04 Dec 2023 18:39:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rADqv-000moq-Ev for pgsql-sql@arkaria.postgresql.org; Mon, 04 Dec 2023 18:39:33 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rADqv-000mog-5I for pgsql-sql@lists.postgresql.org; Mon, 04 Dec 2023 18:39:33 +0000 Received: from mail-ej1-f48.google.com ([209.85.218.48]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rADqs-008nBI-DW for pgsql-sql@lists.postgresql.org; Mon, 04 Dec 2023 18:39:31 +0000 Received: by mail-ej1-f48.google.com with SMTP id a640c23a62f3a-9fa2714e828so643029866b.1 for ; Mon, 04 Dec 2023 10:39:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701715168; x=1702319968; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=t+JbndMYywUZk5dAVCyZjNn6PNM1fv25TPRN6uGWNds=; b=Kn5pmsnrUXuaWwed+tJGnHrHp+ZrGDnQcyhUuQwiP/732GUUN3t+vNK//4W26WqBl+ LCn5+3s8Kvgswrpt3XurlPhrawFt1Tj2X3RXzxfvUSJHFM4mK08VQsa5f8ccGJWG7GwW REM/mCv2HHzLREEJiyXvHAXG7OxZaiujEToEycTPCuapmZUKCTraoFZe7tk3gdiv3clZ ctL0VtStlYEvqfbDOuf7T54bh6e/Vng56bnW+V7mWdSpPichWVI4n6AxchOSFdL+hVYB u78a2P7Xuuui7/BrJeRTuy66y7V8sYAzhvCCepqVCkrPLbnwy6wob7EN9PDLRKpOJY7X RQPA== X-Gm-Message-State: AOJu0YyHr9C9U63Gzzn/GqZOy2fi6p3b2MgKts33PKL4sgUzKnClLSBN LmUj2V5A9/QXHFd29akP+U5nt7mUaXylsQ== X-Google-Smtp-Source: AGHT+IGQJ88jkSovAlFH/i7A4h0zwZrqg690prML3T0kIk9Ym8dIf0tHkku0hq+mNdEO1NY/9TAe6Q== X-Received: by 2002:a17:906:3f4f:b0:a02:38a2:4d79 with SMTP id f15-20020a1709063f4f00b00a0238a24d79mr3044523ejj.41.1701715168167; Mon, 04 Dec 2023 10:39:28 -0800 (PST) Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com. [209.85.221.45]) by smtp.gmail.com with ESMTPSA id uz14-20020a170907118e00b009e5db336137sm5548024ejb.196.2023.12.04.10.39.24 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Dec 2023 10:39:25 -0800 (PST) Received: by mail-wr1-f45.google.com with SMTP id ffacd0b85a97d-33318b866a0so4499021f8f.3 for ; Mon, 04 Dec 2023 10:39:24 -0800 (PST) X-Received: by 2002:adf:f843:0:b0:333:393a:2e8c with SMTP id d3-20020adff843000000b00333393a2e8cmr2044309wrq.9.1701715164157; Mon, 04 Dec 2023 10:39:24 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Steve Midgley Date: Mon, 4 Dec 2023 10:39:12 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Overcoming Initcap Function limitations? To: Greg Sabino Mullane Cc: Bo Guo , pgsql-sql@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000de62bf060bb36f57" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000de62bf060bb36f57 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Dec 4, 2023 at 10:09=E2=80=AFAM Greg Sabino Mullane wrote: > It's not clear exactly what you are trying to achieve, but you can use > Postgres' built-in text searching system to exclude stopwords. For exampl= e: > > CREATE FUNCTION initcap_realword(myword TEXT) > returns TEXT language SQL AS > $$ > SELECT CASE WHEN length(to_tsvector(myword)) < 1 > THEN myword ELSE initcap(myword) END; > $$; > > You could extend that to multi-word strings with a little effort. However= , > knowing that macdonald should be MacDonald requires a lot more intelligen= ce > than is provided by any Postgres built-in system or extension that I know > of. What you are looking at is the field of science known as Natural > Language Processing, which can get very complex very quickly. But for a > Postgres answer, you might combine plpython3u with spacy ( > https://spacy.io/usage/spacy-101). > > Cheers, > Greg > > I've been having some pretty good experiences with "hard" text transformations such as correct capitalization of names like MacDonald using GPT 3.5 Turbo API which is pretty cheap for the volume of data I've been working with.. Seems like Spacy might do similar things, and if it can be run locally, might be much cheaper than a rental API.. Steve --000000000000de62bf060bb36f57 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Mon, Dec 4, 2023 at 10:09=E2=80=AF= AM Greg Sabino Mullane <htamfids@g= mail.com> wrote:
It's not clear exactly what you are trying to = achieve, but you can use Postgres' built-in text searching system to ex= clude stopwords. For example:

CREATE FUNCTION initcap_re= alword(myword TEXT)
=C2=A0 returns TEXT language SQL AS=C2=A0
$$
SELECT CASE WHEN length(to_tsvector(myword)) < 1
=C2=A0 THEN myword=C2=A0ELSE initcap(myword) END;
$$;
<= /div>

You could extend that to multi-word strings with a= little effort. However, knowing that macdonald should be MacDonald require= s a lot more intelligence than is provided by any Postgres built-in system = or extension that I know of. What you are looking at is the field of scienc= e known as Natural Language Processing, which can get very complex very qui= ckly. But for a Postgres answer, you might combine plpython3u with spacy (<= a href=3D"https://spacy.io/usage/spacy-101" target=3D"_blank">https://spacy= .io/usage/spacy-101).

Cheers,
Greg

I've been having some pretty= good experiences with "hard" text transformations such as correc= t capitalization of names like MacDonald using GPT 3.5 Turbo API which is p= retty cheap for the volume of data I've been working with.. Seems like = Spacy might do similar things, and if it can be run locally, might be much = cheaper than a rental API..

Steve
--000000000000de62bf060bb36f57--