Re: full text search and hyphens in uuid

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Martin Norbäck Olivers <[email protected]>
To: Steve Midgley <[email protected]>
Cc: [email protected]
Subject: Re: full text search and hyphens in uuid
Date: Sun, 29 Oct 2023 17:31:07 +0100
Message-ID: <CALoTC6sk1_8-r6e-9-t2oAe79yX94iG=AOkSH2nHRnxgDrHGFg@mail.gmail.com> (raw)
In-Reply-To: <CAJexoSK=OyX3Phmckkb4-X+KxSjXVrpNLVgDa4BdH+EHqOEgAg@mail.gmail.com>
References: <CALoTC6s=QAvj=yw2cY=8t_dyQsByXF_AT8k=z-YXOcgcj3sO=g@mail.gmail.com>
	<CAJexoSK=OyX3Phmckkb4-X+KxSjXVrpNLVgDa4BdH+EHqOEgAg@mail.gmail.com>

Hi! Thanks for answering.

My use case for doing this is that I have uuids embedded within the text
data (it's JSON data actually) and I just index to_tsvector('simple',
json_column).

And I want to search for the uuids sometimes, and it's not predetermined
which json keys contain them. But it does seem like it's not possible to
change the to_tsvector lexer, so I guess I will have to extract the uuids
when inserting the data and index them separately.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48 PM Steve Midgley <[email protected]> wrote:

> On Fri, Oct 27, 2023 at 4:49 AM Martin Norbäck Olivers <[email protected]>
> wrote:
>
>> Hi!
>> I have a problem with full text search and uuids in the text which
>> I index using to_tsvector . I have uuids in my text and most of the time,
>> it works well because they are lexed as words so I can just search for the
>> parts of the uuid.
>>
>> The problem is an uuid like this:
>> select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')
>>
>> Which gives this result
>> '-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> So, I found dict_int and asked it to remove the minus signs
>>
>> create extension dict_int;
>> ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
>> alter text search configuration simple alter mapping for int, uint with
>> intdict
>>
>>  and now I get this result instead:
>> '0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> which is slightly better, but still not good enough because there is no
>> token 59eb. It's being split into 59 and eb.
>>
>> Is there any way to change this behaviour of the tsvector lexer? Do I
>> have to write my own tsvector or is there a way to "turn off" integer
>> handling in the lexer?
>>
>> Regards,
>> Martin
>>
>> I don't understand your use case for doing this, but it seems like you
> could use something other than ts_vector to break apart your uuids, and
> then index them? It seems like ts_vector is primarily used to find things
> that are near to other things via their vector signatures (at least that's
> my understanding). But doing vector component math on segments of a UUID
> seems meaningless since the UUID is mostly random?
>
> So couldn't you break your UUID into separate fields, or barring that into
> a jsonb or array field that contains the components, and then just index
> that computed field? Maybe that could even be achieved in a view, if you
> don't want to alter your core table?
>
> Obviously all this could be insensible, if I'm not following the purpose
> of your use of ts_vector..
> Best,
> Steve
>


-- 
Martin Norbäck Olivers
IT-konsult, Masara AB
Telefon: +46 703 22 70 12
E-post: [email protected]
Kärrhöksvägen 4
656 72 Skattkärr

view thread (3+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: full text search and hyphens in uuid
  In-Reply-To: <CALoTC6sk1_8-r6e-9-t2oAe79yX94iG=AOkSH2nHRnxgDrHGFg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox