Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qwLKe-00BDBp-2M for pgsql-sql@arkaria.postgresql.org; Fri, 27 Oct 2023 11:48:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1qwLKc-00Eta5-7T for pgsql-sql@arkaria.postgresql.org; Fri, 27 Oct 2023 11:48:50 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qwLKb-00EtZx-Na for pgsql-sql@lists.postgresql.org; Fri, 27 Oct 2023 11:48:49 +0000 Received: from mail-pf1-x42d.google.com ([2607:f8b0:4864:20::42d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1qwLKY-002Z5U-JC for pgsql-sql@lists.postgresql.org; Fri, 27 Oct 2023 11:48:48 +0000 Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-6b1e46ca282so1934419b3a.2 for ; Fri, 27 Oct 2023 04:48:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail-com.20230601.gappssmtp.com; s=20230601; t=1698407325; x=1699012125; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=BXiYWAg7UfBC0KNSxRBLz9VVbcr8nryN8t19pqpYFBc=; b=AZqdH3ESj4nLL0a7dIrsZnuUNOVD6LcZkMeHWKG/JuFWvtlWefPHMcGroD26d8Aazn zInOZsBMtbJ12xfUeKGa90wxNLozyPYhN2l0H+nmzZ//6gtVcs1HNXZqVFfoTiHYqaQq Tc6XyG9qlXP2LX4DRyXRlyIvEQQS87wZvffvcxhscuRgOEE05cLo8p2XupdJo/HH9r9o YDX82ZwkBXowQGgdwvbdBSmPb1EyMOqXeBxETe7AP80H1t+TUFGhBAfM1hDjMcnwViep vs8RBtczRxxCTPMv1iz7NR0ItsLQqqHhPtZlh+Xs8ABwVmkF+nUDDemN76tgsgvPMxUK c1Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698407325; x=1699012125; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=BXiYWAg7UfBC0KNSxRBLz9VVbcr8nryN8t19pqpYFBc=; b=weRpb8yxc1c8k9J5OpsQoxz83MM+u+vNy1FH+rzSWuI/yNQwfT2H87qGdbo2t9tU94 3MsuWSmCKEAsYLVSZdU+cOf9CwOjqXnpJrG2M6frDskeQ+TptukNJYBLYPemwgphH0rv +IDl1em5GjwHF4AvwfkB5G6O7LVf+DMa883CUnuFwATBZo7cZmVt7Fq2Tcsl0B4bAcCg sBAdkZFSe3UTFL0FikYhG6RqiVk5J4Fk2E374PiqvQkbddamr89tqJ1oNUflqL2Ae3wS Va5p379U7JJNQQKK4tMctE5BRvVF/i+wHbBqkVQmP07Qg3F/Hla135pZ109mPPhnd4Xk K13w== X-Gm-Message-State: AOJu0YxlkERWR0CdjaWgcd70M0lArkI/ft4N8iD0N278+xExzPc14DSy lkR8q3Qg43ZDpDTSdZ30J9nZC56fYkW0u6r51ZXkJtyf8sw= X-Google-Smtp-Source: AGHT+IHLNbGuVjrYj0R3oZDbC4LKrs4VVBBpRcsg56Uy+xT70rEh2OoVBrVY2fFhGSW+xEDEJH/LxP9MeZXYh1dzXnA= X-Received: by 2002:a05:6a20:3956:b0:154:e887:f581 with SMTP id r22-20020a056a20395600b00154e887f581mr3087595pzg.58.1698407325060; Fri, 27 Oct 2023 04:48:45 -0700 (PDT) MIME-Version: 1.0 From: =?UTF-8?Q?Martin_Norb=C3=A4ck_Olivers?= Date: Fri, 27 Oct 2023 13:48:32 +0200 Message-ID: Subject: full text search and hyphens in uuid To: pgsql-sql@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000004b53f00608b14558" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000004b53f00608b14558 Content-Type: text/plain; charset="UTF-8" Hi! I have a problem with full text search and uuids in the text which I index using to_tsvector . I have uuids in my text and most of the time, it works well because they are lexed as words so I can just search for the parts of the uuid. The problem is an uuid like this: select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361') Which gives this result '-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4 So, I found dict_int and asked it to remove the minus signs create extension dict_int; ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true); alter text search configuration simple alter mapping for int, uint with intdict and now I get this result instead: '0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5 'eb-abcd-47aa57184361':4 which is slightly better, but still not good enough because there is no token 59eb. It's being split into 59 and eb. Is there any way to change this behaviour of the tsvector lexer? Do I have to write my own tsvector or is there a way to "turn off" integer handling in the lexer? Regards, Martin --0000000000004b53f00608b14558 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi!
I have a problem with full text search and uuids in the te= xt which I=C2=A0index using to_tsvector . I have uuids in my text and most = of the=C2=A0time, it works well because they are lexed as words so I can ju= st search for the parts of the uuid.

The problem is an uuid like thi= s:
select to_tsvector('simple','0232710f-8545-59eb-abcd-47a= a57184361')

Which gives this result
<= div>
'-59':3 '-8545':2 '0232710f':1 '47aa57= 184361':7 'abcd':6 'eb':5 'eb-abcd-47aa57184361'= ;:4

So, I found dict_int and asked it to rem= ove the minus signs

create extension dict_int= ;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN =3D 12, absval =3D= true);
alter text search configuration simple alter mapping for = int, uint with intdict

=C2=A0and now I get t= his result instead:
'0232710f':1 '47aa57184361':7 &= #39;59':3 '8545':2 'abcd':6 'eb':5 'eb-abcd= -47aa57184361':4

which is slightly better, but= still not good enough because there is no token 59eb. It's being split= into 59 and eb.

Is there any way to change this behaviour of the ts= vector lexer? Do I have to write my own tsvector or is there a way to "= ;turn off" integer handling in the lexer?

Regards,
Martin
--0000000000004b53f00608b14558--