full text search and hyphens in uuid

public inbox for [email protected]  
help / color / mirror / Atom feed

full text search and hyphens in uuid
3+ messages / 2 participants
[nested] [flat]

* full text search and hyphens in uuid
@ 2023-10-27 11:48 Martin Norbäck Olivers <[email protected]>
  2023-10-28 02:05 ` Re: full text search and hyphens in uuid Tom Lane <[email protected]>
  2023-10-29 16:31 ` Re: full text search and hyphens in uuid Martin Norbäck Olivers <[email protected]>
  0 siblings, 2 replies; 3+ messages in thread

From: Martin Norbäck Olivers @ 2023-10-27 11:48 UTC (permalink / raw)
  To: [email protected]

Hi!
I have a problem with full text search and uuids in the text which I index
using to_tsvector . I have uuids in my text and most of the time, it works
well because they are lexed as words so I can just search for the parts of
the uuid.

The problem is an uuid like this:
select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')

Which gives this result
'-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5
'eb-abcd-47aa57184361':4

So, I found dict_int and asked it to remove the minus signs

create extension dict_int;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
alter text search configuration simple alter mapping for int, uint with
intdict

 and now I get this result instead:
'0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5
'eb-abcd-47aa57184361':4

which is slightly better, but still not good enough because there is no
token 59eb. It's being split into 59 and eb.

Is there any way to change this behaviour of the tsvector lexer? Do I have
to write my own tsvector or is there a way to "turn off" integer handling
in the lexer?

Regards,
Martin

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: full text search and hyphens in uuid
  2023-10-27 11:48 full text search and hyphens in uuid Martin Norbäck Olivers <[email protected]>
@ 2023-10-28 02:05 ` Tom Lane <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Tom Lane @ 2023-10-28 02:05 UTC (permalink / raw)
  To: Martin Norbäck Olivers <[email protected]>; +Cc: [email protected]

=?UTF-8?Q?Martin_Norb=C3=A4ck_Olivers?= <[email protected]> writes:
> Is there any way to change this behaviour of the tsvector lexer? Do I have
> to write my own tsvector or is there a way to "turn off" integer handling
> in the lexer?

Sadly, no, you'd have to write your own lexer.  This is a weak spot
in text search configurability for sure.  But I lack any ideas about
how to make it better.

			regards, tom lane





^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: full text search and hyphens in uuid
  2023-10-27 11:48 full text search and hyphens in uuid Martin Norbäck Olivers <[email protected]>
@ 2023-10-29 16:31 ` Martin Norbäck Olivers <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Martin Norbäck Olivers @ 2023-10-29 16:31 UTC (permalink / raw)
  To: Steve Midgley <[email protected]>; +Cc: [email protected]

Hi! Thanks for answering.

My use case for doing this is that I have uuids embedded within the text
data (it's JSON data actually) and I just index to_tsvector('simple',
json_column).

And I want to search for the uuids sometimes, and it's not predetermined
which json keys contain them. But it does seem like it's not possible to
change the to_tsvector lexer, so I guess I will have to extract the uuids
when inserting the data and index them separately.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48 PM Steve Midgley <[email protected]> wrote:

> On Fri, Oct 27, 2023 at 4:49 AM Martin Norbäck Olivers <[email protected]>
> wrote:
>
>> Hi!
>> I have a problem with full text search and uuids in the text which
>> I index using to_tsvector . I have uuids in my text and most of the time,
>> it works well because they are lexed as words so I can just search for the
>> parts of the uuid.
>>
>> The problem is an uuid like this:
>> select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')
>>
>> Which gives this result
>> '-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> So, I found dict_int and asked it to remove the minus signs
>>
>> create extension dict_int;
>> ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
>> alter text search configuration simple alter mapping for int, uint with
>> intdict
>>
>>  and now I get this result instead:
>> '0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> which is slightly better, but still not good enough because there is no
>> token 59eb. It's being split into 59 and eb.
>>
>> Is there any way to change this behaviour of the tsvector lexer? Do I
>> have to write my own tsvector or is there a way to "turn off" integer
>> handling in the lexer?
>>
>> Regards,
>> Martin
>>
>> I don't understand your use case for doing this, but it seems like you
> could use something other than ts_vector to break apart your uuids, and
> then index them? It seems like ts_vector is primarily used to find things
> that are near to other things via their vector signatures (at least that's
> my understanding). But doing vector component math on segments of a UUID
> seems meaningless since the UUID is mostly random?
>
> So couldn't you break your UUID into separate fields, or barring that into
> a jsonb or array field that contains the components, and then just index
> that computed field? Maybe that could even be achieved in a view, if you
> don't want to alter your core table?
>
> Obviously all this could be insensible, if I'm not following the purpose
> of your use of ts_vector..
> Best,
> Steve
>


-- 
Martin Norbäck Olivers
IT-konsult, Masara AB
Telefon: +46 703 22 70 12
E-post: [email protected]
Kärrhöksvägen 4
656 72 Skattkärr


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2023-10-29 16:31 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-10-27 11:48 full text search and hyphens in uuid Martin Norbäck Olivers <[email protected]>
2023-10-28 02:05 ` Tom Lane <[email protected]>
2023-10-29 16:31 ` Martin Norbäck Olivers <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox