public inbox for [email protected]  
help / color / mirror / Atom feed
From: Linus Heckemann <[email protected]>
To: pgsql-general mailing list <[email protected]>
Subject: Re: Can we get sha* function over text, that could be used in index?
Date: Wed, 18 Feb 2026 19:02:10 +0100
Message-ID: <ygawm0a9bh9.fsf@localhost> (raw)

Hi, I'm the one who asked :)

Ron Johnson wrote:
> On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz(at)depesz(dot)com>
> wrote:
> 
> > Hi,
> > So, we have various sha* functions.
> >
> > And I recently got asked about using them as a based for unique index
> > over long texts.
> > Normally one would do it with md5(text), but the person asking wanted to
> > use sha().
> 
> I think I'd push back, asking them if they really need
> cryptographically-secure hashing (which they most probably don't).

I indeed don't need cryptographically-secure hashing in this
scenario, and I will be using md5.

Nevertheless, I think using md5 in any context that isn't "to verify
digests produced in the past" is a poor decision, and postgres should
make using other hashes just as easy!

The specific context of hashing aside, it seems weird to me that:

- there is a byte-array representation of text columns, which appears to
  be independent of database encoding
- there doesn't seem to be _any_ sane way to access this.

The obvious (to a naive user, like I was) approach, casting to bytea,
has exceptionally surprising behaviour: for many text strings, it does
exactly what the naive user might hope for, giving back the UTF-8
representation. But multiple distinct text strings, like '\033' and
'\x1b', convert to the same byte string! And text strings containing a
backslash that doesn't fit the bytea hex format or the bytea escape
format will fail to convert completely!

The fact that convert_to() is only stable and not immutable makes sense
to me given the effect that configuration can have on its behaviour, but
given that there does appear to be a trivially-accessible UTF-8
representation (as used by md5()) I think there should be an immutable
function that provides access to it? Is there a good reason not to? I'd
be willing to send a patch for it myself.

Linus






view thread (2+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected]
  Subject: Re: Can we get sha* function over text, that could be used in index?
  In-Reply-To: <ygawm0a9bh9.fsf@localhost>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox