MIME-Version: 1.0
References: 
 <CALoTC6s=QAvj=yw2cY=8t_dyQsByXF_AT8k=z-YXOcgcj3sO=g@mail.gmail.com>
 <CAJexoSK=OyX3Phmckkb4-X+KxSjXVrpNLVgDa4BdH+EHqOEgAg@mail.gmail.com>
In-Reply-To: 
 <CAJexoSK=OyX3Phmckkb4-X+KxSjXVrpNLVgDa4BdH+EHqOEgAg@mail.gmail.com>
From: =?UTF-8?Q?Martin_Norb=C3=A4ck_Olivers?= <martin@norpan.org>
Date: Sun, 29 Oct 2023 17:31:07 +0100
Message-ID: 
 <CALoTC6sk1_8-r6e-9-t2oAe79yX94iG=AOkSH2nHRnxgDrHGFg@mail.gmail.com>
Subject: Re: full text search and hyphens in uuid
To: Steve Midgley <science@misuse.org>
Cc: pgsql-sql@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000b1d52d0608dd73ba"
Archived-At: 
 <https://www.postgresql.org/message-id/CALoTC6sk1_8-r6e-9-t2oAe79yX94iG%3DAOkSH2nHRnxgDrHGFg%40mail.gmail.com>
Precedence: bulk

--000000000000b1d52d0608dd73ba
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi! Thanks for answering.

My use case for doing this is that I have uuids embedded within the text
data (it's JSON data actually) and I just index to_tsvector('simple',
json_column).

And I want to search for the uuids sometimes, and it's not predetermined
which json keys contain them. But it does seem like it's not possible to
change the to_tsvector lexer, so I guess I will have to extract the uuids
when inserting the data and index them separately.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48=E2=80=AFPM Steve Midgley <science@misuse.org> =
wrote:

> On Fri, Oct 27, 2023 at 4:49=E2=80=AFAM Martin Norb=C3=A4ck Olivers <mart=
in@norpan.org>
> wrote:
>
>> Hi!
>> I have a problem with full text search and uuids in the text which
>> I index using to_tsvector . I have uuids in my text and most of the time=
,
>> it works well because they are lexed as words so I can just search for t=
he
>> parts of the uuid.
>>
>> The problem is an uuid like this:
>> select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')
>>
>> Which gives this result
>> '-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> So, I found dict_int and asked it to remove the minus signs
>>
>> create extension dict_int;
>> ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN =3D 12, absval =3D true);
>> alter text search configuration simple alter mapping for int, uint with
>> intdict
>>
>>  and now I get this result instead:
>> '0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> which is slightly better, but still not good enough because there is no
>> token 59eb. It's being split into 59 and eb.
>>
>> Is there any way to change this behaviour of the tsvector lexer? Do I
>> have to write my own tsvector or is there a way to "turn off" integer
>> handling in the lexer?
>>
>> Regards,
>> Martin
>>
>> I don't understand your use case for doing this, but it seems like you
> could use something other than ts_vector to break apart your uuids, and
> then index them? It seems like ts_vector is primarily used to find things
> that are near to other things via their vector signatures (at least that'=
s
> my understanding). But doing vector component math on segments of a UUID
> seems meaningless since the UUID is mostly random?
>
> So couldn't you break your UUID into separate fields, or barring that int=
o
> a jsonb or array field that contains the components, and then just index
> that computed field? Maybe that could even be achieved in a view, if you
> don't want to alter your core table?
>
> Obviously all this could be insensible, if I'm not following the purpose
> of your use of ts_vector..
> Best,
> Steve
>


--=20
Martin Norb=C3=A4ck Olivers
IT-konsult, Masara AB
Telefon: +46 703 22 70 12
E-post: martin@norpan.org
K=C3=A4rrh=C3=B6ksv=C3=A4gen 4
656 72 Skattk=C3=A4rr

--000000000000b1d52d0608dd73ba
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi! Thanks for answering.<br><br>My use case for doin=
g this is that I have uuids embedded within the text data (it&#39;s JSON da=
ta actually) and I just index to_tsvector(&#39;simple&#39;, json_column).</=
div><div><br></div><div>And I want to search for the uuids sometimes, and i=
t&#39;s not predetermined which json keys contain them. But it does seem li=
ke it&#39;s not possible to change the to_tsvector lexer, so I guess I will=
 have to extract the uuids when inserting the data and index them separatel=
y.<br><br>Regards,<br>Martin</div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr" class=3D"gmail_attr">On Sat, Oct 28, 2023 at 5:48=E2=80=AFPM Steve=
 Midgley &lt;<a href=3D"mailto:science@misuse.org">science@misuse.org</a>&g=
t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div d=
ir=3D"ltr"><div dir=3D"ltr">On Fri, Oct 27, 2023 at 4:49=E2=80=AFAM Martin =
Norb=C3=A4ck Olivers &lt;<a href=3D"mailto:martin@norpan.org" target=3D"_bl=
ank">martin@norpan.org</a>&gt; wrote:<br></div><div class=3D"gmail_quote"><=
blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-l=
eft:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div dir=
=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">Hi!<br>I have a=
 problem with full text search and uuids in the text which I=C2=A0index usi=
ng to_tsvector . I have uuids in my text and most of the=C2=A0time, it work=
s well because they are lexed as words so I can just search for the parts o=
f the uuid.<br><br>The problem is an uuid like this:<div>select to_tsvector=
(&#39;simple&#39;,&#39;0232710f-8545-59eb-abcd-47aa57184361&#39;)<br></div>=
<div><br></div><div>Which gives this result</div><div><div>&#39;-59&#39;:3 =
&#39;-8545&#39;:2 &#39;0232710f&#39;:1 &#39;47aa57184361&#39;:7 &#39;abcd&#=
39;:6 &#39;eb&#39;:5 &#39;eb-abcd-47aa57184361&#39;:4</div></div><div><br><=
/div><div>So, I found dict_int and asked it to remove the minus signs</div>=
<div><br></div><div><div>create extension dict_int;</div><div>ALTER TEXT SE=
ARCH DICTIONARY intdict (MAXLEN =3D 12, absval =3D true);</div><div>alter t=
ext search configuration simple alter mapping for int, uint with intdict</d=
iv></div><div><br></div><div>=C2=A0and now I get this result instead:<br><d=
iv>&#39;0232710f&#39;:1 &#39;47aa57184361&#39;:7 &#39;59&#39;:3 &#39;8545&#=
39;:2 &#39;abcd&#39;:6 &#39;eb&#39;:5 &#39;eb-abcd-47aa57184361&#39;:4</div=
><div><br></div><div>which is slightly better, but still not good enough be=
cause there is no token 59eb. It&#39;s being split into 59 and eb.<br><br>I=
s there any way to change this behaviour of the tsvector lexer? Do I have t=
o write my own tsvector or is there a way to &quot;turn off&quot; integer h=
andling in the lexer?<br><br>Regards,<br>Martin</div><br></div></div></div>=
</div></div></div></blockquote><div id=3D"m_1773349627441846230gmail-:9h" a=
ria-label=3D"Message Body" role=3D"textbox" aria-multiline=3D"true" style=
=3D"direction:ltr;min-height:85px" aria-controls=3D":ca">I don&#39;t unders=
tand your use case for doing this, but it seems like you could use somethin=
g other than ts_vector to break apart your uuids, and then index them? It s=
eems like ts_vector is primarily used to find things that are near to other=
 things via their vector signatures (at least that&#39;s my understanding).=
 But doing vector component math on segments of a UUID seems meaningless si=
nce the UUID is mostly random?<div><br></div><div>So couldn&#39;t you break=
 your UUID into separate=C2=A0fields, or barring that into a jsonb or array=
 field that contains the components, and then just index that computed fiel=
d? Maybe that could even be achieved in a view, if you don&#39;t want to al=
ter your core table?=C2=A0</div><div><br></div><div>Obviously all this coul=
d be insensible, if I&#39;m not following the purpose of your use of ts_vec=
tor..</div><div>Best,</div></div><div>Steve=C2=A0</div></div>
</div>
</blockquote></div><br clear=3D"all"><div><br></div><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr">Martin Norb=C3=A4ck Olivers<div>IT-konsult, Masara AB</div><=
div>Telefon: +46 703 22 70 12</div><div>E-post: <a href=3D"mailto:martin@no=
rpan.org" target=3D"_blank">martin@norpan.org</a></div><div>K=C3=A4rrh=C3=
=B6ksv=C3=A4gen 4</div><div>656 72 Skattk=C3=A4rr</div></div></div></div>

--000000000000b1d52d0608dd73ba--