MIME-Version: 1.0
From: =?UTF-8?Q?Martin_Norb=C3=A4ck_Olivers?= <martin@norpan.org>
Date: Fri, 27 Oct 2023 13:48:32 +0200
Message-ID: 
 <CALoTC6s=QAvj=yw2cY=8t_dyQsByXF_AT8k=z-YXOcgcj3sO=g@mail.gmail.com>
Subject: full text search and hyphens in uuid
To: pgsql-sql@lists.postgresql.org
Content-Type: multipart/alternative; boundary="0000000000004b53f00608b14558"
Archived-At: 
 <https://www.postgresql.org/message-id/CALoTC6s%3DQAvj%3Dyw2cY%3D8t_dyQsByXF_AT8k%3Dz-YXOcgcj3sO%3Dg%40mail.gmail.com>
Precedence: bulk

--0000000000004b53f00608b14558
Content-Type: text/plain; charset="UTF-8"

Hi!
I have a problem with full text search and uuids in the text which I index
using to_tsvector . I have uuids in my text and most of the time, it works
well because they are lexed as words so I can just search for the parts of
the uuid.

The problem is an uuid like this:
select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')

Which gives this result
'-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5
'eb-abcd-47aa57184361':4

So, I found dict_int and asked it to remove the minus signs

create extension dict_int;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
alter text search configuration simple alter mapping for int, uint with
intdict

 and now I get this result instead:
'0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5
'eb-abcd-47aa57184361':4

which is slightly better, but still not good enough because there is no
token 59eb. It's being split into 59 and eb.

Is there any way to change this behaviour of the tsvector lexer? Do I have
to write my own tsvector or is there a way to "turn off" integer handling
in the lexer?

Regards,
Martin

--0000000000004b53f00608b14558
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr">Hi!<br>I have a problem with full text search and uuids in the te=
xt which I=C2=A0index using to_tsvector . I have uuids in my text and most =
of the=C2=A0time, it works well because they are lexed as words so I can ju=
st search for the parts of the uuid.<br><br>The problem is an uuid like thi=
s:<div>select to_tsvector(&#39;simple&#39;,&#39;0232710f-8545-59eb-abcd-47a=
a57184361&#39;)<br></div><div><br></div><div>Which gives this result</div><=
div><div>&#39;-59&#39;:3 &#39;-8545&#39;:2 &#39;0232710f&#39;:1 &#39;47aa57=
184361&#39;:7 &#39;abcd&#39;:6 &#39;eb&#39;:5 &#39;eb-abcd-47aa57184361&#39=
;:4</div></div><div><br></div><div>So, I found dict_int and asked it to rem=
ove the minus signs</div><div><br></div><div><div>create extension dict_int=
;</div><div>ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN =3D 12, absval =3D=
 true);</div><div>alter text search configuration simple alter mapping for =
int, uint with intdict</div></div><div><br></div><div>=C2=A0and now I get t=
his result instead:<br><div>&#39;0232710f&#39;:1 &#39;47aa57184361&#39;:7 &=
#39;59&#39;:3 &#39;8545&#39;:2 &#39;abcd&#39;:6 &#39;eb&#39;:5 &#39;eb-abcd=
-47aa57184361&#39;:4</div><div><br></div><div>which is slightly better, but=
 still not good enough because there is no token 59eb. It&#39;s being split=
 into 59 and eb.<br><br>Is there any way to change this behaviour of the ts=
vector lexer? Do I have to write my own tsvector or is there a way to &quot=
;turn off&quot; integer handling in the lexer?<br><br>Regards,<br>Martin</d=
iv><br></div></div></div></div></div></div>

--0000000000004b53f00608b14558--