Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qx8hE-00GIBs-2H for pgsql-sql@arkaria.postgresql.org; Sun, 29 Oct 2023 16:31:28 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1qx8hC-008Mxw-NU for pgsql-sql@arkaria.postgresql.org; Sun, 29 Oct 2023 16:31:26 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qx8hC-008Mxo-56 for pgsql-sql@lists.postgresql.org; Sun, 29 Oct 2023 16:31:26 +0000 Received: from mail-pj1-x102e.google.com ([2607:f8b0:4864:20::102e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1qx8h9-002uop-12 for pgsql-sql@lists.postgresql.org; Sun, 29 Oct 2023 16:31:24 +0000 Received: by mail-pj1-x102e.google.com with SMTP id 98e67ed59e1d1-2802e5ae23bso862472a91.2 for ; Sun, 29 Oct 2023 09:31:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail-com.20230601.gappssmtp.com; s=20230601; t=1698597082; x=1699201882; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=iyEuA31CAkTsvk5F5mPej9mBkONJkQySqx0esti5c+w=; b=PfIekN1MBEJSkIMVDH6JOJ3DXU1h5QoF/18YrmYgc6hLPg0lNDittzVsQI8k+DKlPU vzn0BaoBodMj3yEViIEv6VPR5O5Zd/LRqa3icLzPT08+bzu1jakEmWZHWHJEkoHyWfLE g2br6ts7pmUfUubrYUY/k9pVRi/iCBfhqiyaxVKMmtSGMnf6T1TdcmZbE+xKhE/6OIIu nn+SjbwVxunQ2h3hi7/F7BkZxs9U3NRancp27I4+6mehK1SxUVT/pDRLjJQ51bC3rHY3 rE4nBNB0Fkt6LUuqgGsIWsm124SRVoxcxtPhG05yc0qG7xtLwlkvm4hHpNaKcZ47mOvW KVIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698597082; x=1699201882; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iyEuA31CAkTsvk5F5mPej9mBkONJkQySqx0esti5c+w=; b=AeQaG89SdHr8m+uF7VLU4k9AULTXfNobDeLCHCDTvpoXgE4GNBa9K1ESzw+ObvUXkz jTIl6Cu4sT6eoKlzCrnFphQKJANsnEcfbdlkzz1c5HROc8W+FrSbNZZkQ6A2nuTpr+qY heY6gQCRItC//BP/M2S1w1ifQoFNE+1IjJ+TSDHaVDVc+OSI1dGTnU3ZqQvvGxjQppuj mt1afN4ziPXfRgzLFR1f8Rwg/cx8hfjXIVsBcWXHiDI+cRycGLIb9YY4tbz/TlqDNmLV SuHM4p8iXxTXfcrdGUIKPgt4N/TbBkF8AHNz7wAhXR5DE7OVa6EAbTPXRBF8VSzaIHM2 1Lqg== X-Gm-Message-State: AOJu0YwNbO0Wegy2YMwmmj9RuZP4GGK/98lPEZsd9HpWGG5rl+kwl8+g 7bg8F1IajIDWeRgqyxVndTWUgKT2+Yipc5slAPOS/kjB X-Google-Smtp-Source: AGHT+IHMr6Xig/3UQSKNUM9t0e6sOR6WNtocqbMp5tGkzTVPGHDBqBM7xNcT/Rr46V9kLeQZxz4MlpXO8+P/imMxYHk= X-Received: by 2002:a17:90b:3901:b0:27d:b811:2fe4 with SMTP id ob1-20020a17090b390100b0027db8112fe4mr7005033pjb.26.1698597082091; Sun, 29 Oct 2023 09:31:22 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?Q?Martin_Norb=C3=A4ck_Olivers?= Date: Sun, 29 Oct 2023 17:31:07 +0100 Message-ID: Subject: Re: full text search and hyphens in uuid To: Steve Midgley Cc: pgsql-sql@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000b1d52d0608dd73ba" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000b1d52d0608dd73ba Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi! Thanks for answering. My use case for doing this is that I have uuids embedded within the text data (it's JSON data actually) and I just index to_tsvector('simple', json_column). And I want to search for the uuids sometimes, and it's not predetermined which json keys contain them. But it does seem like it's not possible to change the to_tsvector lexer, so I guess I will have to extract the uuids when inserting the data and index them separately. Regards, Martin On Sat, Oct 28, 2023 at 5:48=E2=80=AFPM Steve Midgley = wrote: > On Fri, Oct 27, 2023 at 4:49=E2=80=AFAM Martin Norb=C3=A4ck Olivers > wrote: > >> Hi! >> I have a problem with full text search and uuids in the text which >> I index using to_tsvector . I have uuids in my text and most of the time= , >> it works well because they are lexed as words so I can just search for t= he >> parts of the uuid. >> >> The problem is an uuid like this: >> select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361') >> >> Which gives this result >> '-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5 >> 'eb-abcd-47aa57184361':4 >> >> So, I found dict_int and asked it to remove the minus signs >> >> create extension dict_int; >> ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN =3D 12, absval =3D true); >> alter text search configuration simple alter mapping for int, uint with >> intdict >> >> and now I get this result instead: >> '0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5 >> 'eb-abcd-47aa57184361':4 >> >> which is slightly better, but still not good enough because there is no >> token 59eb. It's being split into 59 and eb. >> >> Is there any way to change this behaviour of the tsvector lexer? Do I >> have to write my own tsvector or is there a way to "turn off" integer >> handling in the lexer? >> >> Regards, >> Martin >> >> I don't understand your use case for doing this, but it seems like you > could use something other than ts_vector to break apart your uuids, and > then index them? It seems like ts_vector is primarily used to find things > that are near to other things via their vector signatures (at least that'= s > my understanding). But doing vector component math on segments of a UUID > seems meaningless since the UUID is mostly random? > > So couldn't you break your UUID into separate fields, or barring that int= o > a jsonb or array field that contains the components, and then just index > that computed field? Maybe that could even be achieved in a view, if you > don't want to alter your core table? > > Obviously all this could be insensible, if I'm not following the purpose > of your use of ts_vector.. > Best, > Steve > --=20 Martin Norb=C3=A4ck Olivers IT-konsult, Masara AB Telefon: +46 703 22 70 12 E-post: martin@norpan.org K=C3=A4rrh=C3=B6ksv=C3=A4gen 4 656 72 Skattk=C3=A4rr --000000000000b1d52d0608dd73ba Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi! Thanks for answering.

My use case for doin= g this is that I have uuids embedded within the text data (it's JSON da= ta actually) and I just index to_tsvector('simple', json_column).

And I want to search for the uuids sometimes, and i= t's not predetermined which json keys contain them. But it does seem li= ke it's not possible to change the to_tsvector lexer, so I guess I will= have to extract the uuids when inserting the data and index them separatel= y.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48=E2=80=AFPM Steve= Midgley <science@misuse.org&g= t; wrote:
On Fri, Oct 27, 2023 at 4:49=E2=80=AFAM Martin = Norb=C3=A4ck Olivers <martin@norpan.org> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-l= eft:1px solid rgb(204,204,204);padding-left:1ex">
Hi!
I have a= problem with full text search and uuids in the text which I=C2=A0index usi= ng to_tsvector . I have uuids in my text and most of the=C2=A0time, it work= s well because they are lexed as words so I can just search for the parts o= f the uuid.

The problem is an uuid like this:
select to_tsvector= ('simple','0232710f-8545-59eb-abcd-47aa57184361')
=

Which gives this result
'-59':3 = '-8545':2 '0232710f':1 '47aa57184361':7 'abcd&#= 39;:6 'eb':5 'eb-abcd-47aa57184361':4

<= /div>
So, I found dict_int and asked it to remove the minus signs
=

create extension dict_int;
ALTER TEXT SE= ARCH DICTIONARY intdict (MAXLEN =3D 12, absval =3D true);
alter t= ext search configuration simple alter mapping for int, uint with intdict

=C2=A0and now I get this result instead:
'0232710f':1 '47aa57184361':7 '59':3 '8545&#= 39;:2 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4

which is slightly better, but still not good enough be= cause there is no token 59eb. It's being split into 59 and eb.

I= s there any way to change this behaviour of the tsvector lexer? Do I have t= o write my own tsvector or is there a way to "turn off" integer h= andling in the lexer?

Regards,
Martin

=
I don't unders= tand your use case for doing this, but it seems like you could use somethin= g other than ts_vector to break apart your uuids, and then index them? It s= eems like ts_vector is primarily used to find things that are near to other= things via their vector signatures (at least that's my understanding).= But doing vector component math on segments of a UUID seems meaningless si= nce the UUID is mostly random?

So couldn't you break= your UUID into separate=C2=A0fields, or barring that into a jsonb or array= field that contains the components, and then just index that computed fiel= d? Maybe that could even be achieved in a view, if you don't want to al= ter your core table?=C2=A0

Obviously all this coul= d be insensible, if I'm not following the purpose of your use of ts_vec= tor..
Best,
Steve=C2=A0


--
Martin Norb=C3=A4ck Olivers
IT-konsult, Masara AB
<= div>Telefon: +46 703 22 70 12
K=C3=A4rrh=C3= =B6ksv=C3=A4gen 4
656 72 Skattk=C3=A4rr
--000000000000b1d52d0608dd73ba--