public inbox for [email protected]
help / color / mirror / Atom feedRe: Can we get sha* function over text, that could be used in index?
2+ messages / 2 participants
[nested] [flat]
* Re: Can we get sha* function over text, that could be used in index?
@ 2026-02-18 18:02 Linus Heckemann <[email protected]>
2026-02-20 21:40 ` Re: Can we get sha* function over text, that could be used in index? Daniel Verite <[email protected]>
0 siblings, 1 reply; 2+ messages in thread
From: Linus Heckemann @ 2026-02-18 18:02 UTC (permalink / raw)
To: pgsql-general
Hi, I'm the one who asked :)
Ron Johnson wrote:
> On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz(at)depesz(dot)com>
> wrote:
>
> > Hi,
> > So, we have various sha* functions.
> >
> > And I recently got asked about using them as a based for unique index
> > over long texts.
> > Normally one would do it with md5(text), but the person asking wanted to
> > use sha().
>
> I think I'd push back, asking them if they really need
> cryptographically-secure hashing (which they most probably don't).
I indeed don't need cryptographically-secure hashing in this
scenario, and I will be using md5.
Nevertheless, I think using md5 in any context that isn't "to verify
digests produced in the past" is a poor decision, and postgres should
make using other hashes just as easy!
The specific context of hashing aside, it seems weird to me that:
- there is a byte-array representation of text columns, which appears to
be independent of database encoding
- there doesn't seem to be _any_ sane way to access this.
The obvious (to a naive user, like I was) approach, casting to bytea,
has exceptionally surprising behaviour: for many text strings, it does
exactly what the naive user might hope for, giving back the UTF-8
representation. But multiple distinct text strings, like '\033' and
'\x1b', convert to the same byte string! And text strings containing a
backslash that doesn't fit the bytea hex format or the bytea escape
format will fail to convert completely!
The fact that convert_to() is only stable and not immutable makes sense
to me given the effect that configuration can have on its behaviour, but
given that there does appear to be a trivially-accessible UTF-8
representation (as used by md5()) I think there should be an immutable
function that provides access to it? Is there a good reason not to? I'd
be willing to send a patch for it myself.
Linus
^ permalink raw reply [nested|flat] 2+ messages in thread
* Re: Can we get sha* function over text, that could be used in index?
2026-02-18 18:02 Re: Can we get sha* function over text, that could be used in index? Linus Heckemann <[email protected]>
@ 2026-02-20 21:40 ` Daniel Verite <[email protected]>
0 siblings, 0 replies; 2+ messages in thread
From: Daniel Verite @ 2026-02-20 21:40 UTC (permalink / raw)
To: Linus Heckemann <[email protected]>; +Cc: pgsql-general
Linus Heckemann wrote:
> - there is a byte-array representation of text columns, which appears to
> be independent of database encoding
Not sure what you're refering to. Both the on-disk and in-memory
representations of text/varchar are encoding-dependent.
> The obvious (to a naive user, like I was) approach, casting to bytea,
> has exceptionally surprising behaviour: for many text strings, it does
> exactly what the naive user might hope for, giving back the UTF-8
> representation. But multiple distinct text strings, like '\033' and
> '\x1b', convert to the same byte string! And text strings containing a
> backslash that doesn't fit the bytea hex format or the bytea escape
> format will fail to convert completely!
Yes. It seems a common mistake to forget or ignore that
backslashes are special in the input text representation of bytea.
It might be not obvious from reading the doc at [1]
but we just need to quote backslashes by doubling them.
AFAIK a working solution for the OP would be:
sha256(replace(colname, '\', '\\')::bytea)
The result is encoding-dependent, but that does not matter
in the context of an expression. index.
If the database ever needs to change its encoding, it will have to
be recreated entirely anyway.
[1]
https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-ESCAPE-FORMAT
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
^ permalink raw reply [nested|flat] 2+ messages in thread
end of thread, other threads:[~2026-02-20 21:40 UTC | newest]
Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-18 18:02 Re: Can we get sha* function over text, that could be used in index? Linus Heckemann <[email protected]>
2026-02-20 21:40 ` Daniel Verite <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox