public inbox for [email protected]  
help / color / mirror / Atom feed
From: Merlin Moncure <[email protected]>
To: Dominique Devienne <[email protected]>
Cc: [email protected]
Subject: Re: Aggregate versions of hashing functions (md5, sha1, etc...)
Date: Thu, 10 Jul 2025 13:02:59 -0600
Message-ID: <CAHyXU0ztigAau2x5zut6tdjNVXENEGR00F-Mvx4X-20zFz-NnQ@mail.gmail.com> (raw)
In-Reply-To: <CAFCRh-8FOZiycyfX4uPB8MTHQTxqNVuW0pdKBuFNQneEZy1PwQ@mail.gmail.com>
References: <CAFCRh-8FOZiycyfX4uPB8MTHQTxqNVuW0pdKBuFNQneEZy1PwQ@mail.gmail.com>

On Thu, Jul 10, 2025 at 5:48 AM Dominique Devienne <[email protected]>
wrote:

> We store scientific information in PostgreSQL, and some of that is
> bytea and large, thus we must "chunk it" both for performance, and not
> be limited to 1GB (we do exceed that, in rare occasions).
>


> Recently I added md5/sha1 hashing support for such values (for various
> reasons, to track corruptions in our ETLs, now fixed, but also in the
> future for custom smart sync and such), and was surprised to discover
> there are no aggregate versions of those functions, neither the
> built-in md5 one (now with bytea overload), nor for pgcrypto. Did I
> miss something?
>
> Any chance this might be added in the future?
>
> Seems so logical to me, that these hashing functions were available
> are aggregates, I can't be the first one to think of that, can it?
>
> Thanks, --DD
>
> PS: The holly-grail IMHO, would be built-in support for hashing, with
> intelligent lazy-compute and persistence correctly reset on changes.
> Probably opt-in.
>
> PPS: Built-in Support for xxh64, or at least in official pgcrypto,
> would also be nice. With aggregate!
>
> PPPS: I remember Oracle LOBs supporting a built-in implicit mime-type
> attribute on them, which was used as the content-type with served over
> the built-in WebDAV extension for the DB. I wish PostgreSQL had
> something like that too.
>
> PPPPS: the lo extension is not viable for us, FWIW. All the above
> should be opt-in on bytea columns IMHO.
>

Does it have to match the md5 of the 'unchunked' variant exactly? If not,
maybe you can rig a custom aggregate that would just "hash amend" the
chunks using the n-1 chunk has as salt, this would be fast and easy, at the
cost of not matching the unchunked variant.

I had to do something very similar with compression, I had a need to be
able to compress bytea values with lz4 at SQL level due to limited support
for extensions in the cloud.  it works great...if a bit slow, and requires
the database to handle the decompression.

merlin


view thread (2+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Aggregate versions of hashing functions (md5, sha1, etc...)
  In-Reply-To: <CAHyXU0ztigAau2x5zut6tdjNVXENEGR00F-Mvx4X-20zFz-NnQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox