public inbox for [email protected]  
help / color / mirror / Atom feed
From: Linus Heckemann <[email protected]>
To: Daniel Verite <[email protected]>
Cc: pgsql-general mailing list <[email protected]>
Subject: Re: Can we get sha* function over text, that could be used in index?
Date: Sat, 21 Feb 2026 10:54:53 +0100
Message-ID: <ygaseaujuaa.fsf@localhost> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>

"Daniel Verite" <[email protected]> writes:
>> - there is a byte-array representation of text columns, which appears to
>>  be independent of database encoding
>
> Not sure what you're refering to. Both the on-disk and in-memory
> representations of text/varchar are encoding-dependent.

Huh, I thought I'd come to the opposite conclusion from testing the
other day (md5('ΓΏ') giving the same results for both a utf-8 database
and a latin1 database), but I must have been holding something wrong,
because trying again does confirm this. Sorry about that!


>> The obvious (to a naive user, like I was) approach, casting to bytea,
>> has exceptionally surprising behaviour: for many text strings, it does
>> exactly what the naive user might hope for, giving back the UTF-8
>> representation. But multiple distinct text strings, like '\033' and
>> '\x1b', convert to the same byte string! And text strings containing a
>> backslash that doesn't fit the bytea hex format or the bytea escape
>> format will fail to convert completely!
>
> Yes. It seems a common mistake to forget or ignore that
> backslashes are special in the input text representation of bytea.
> It might be not obvious from reading the doc at [1]
> but we just need to quote backslashes by doubling them.
>
> AFAIK a working solution for the OP would be:
>  sha256(replace(colname, '\', '\\')::bytea)

That's so simple that it feels like it should have been
obvious... Thanks! Nevertheless, it feels like a weird and potentially
computationally expensive contortion for getting from a text value to
its bytea representation.

Given their encoding-dependence, shouldn't md5() (and ::bytea, for that
matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's

> guaranteed to return the same results given the same arguments forever

only valid within the context of a single database?


md5() and ::bytea aside, it seems like it would be good to have both

 - a function for accessing the bytea representation of a text value
   without any extra steps, and

 - an IMMUTABLE function for getting a UTF-8 (specifically UTF-8,
   because it's the only encoding with a character set encompassing all
   other supported character sets) bytea representation of a text value
   regardless of the database encoding?

Cheers
Linus






reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Can we get sha* function over text, that could be used in index?
  In-Reply-To: <ygaseaujuaa.fsf@localhost>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox