public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Aggregate versions of hashing functions (md5, sha1, etc...)
3+ messages / 2 participants
[nested] [flat]

* Re: Aggregate versions of hashing functions (md5, sha1, etc...)
@ 2025-07-11 09:00  Dominique Devienne <[email protected]>
  0 siblings, 2 replies; 3+ messages in thread

From: Dominique Devienne @ 2025-07-11 09:00 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: pgsql-general

On Thu, Jul 10, 2025 at 7:11 PM Ron Johnson <[email protected]> wrote:
> On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver <[email protected]> wrote:
>> On 7/10/25 04:48, Dominique Devienne wrote:
>> > Seems so logical to me, that these hashing functions were available
>> > are aggregates, I can't be the first one to think of that, can it?
>>
>> I've been on this list since late 2002 and I don't recall this ever
>> being brought up. Now it is entirely possible that age has dimmed my
>> recall abilities:) Though a quick search seems to confirm my memory.
>
> What even is an aggregate hash function?  (I can imagine a few possibilities, but don't want to assume.)

Well, it's so obvious to me, I wonder if you're baiting me :)

Any hasher/digest inits some internal state, processes bytes,
typically in "streaming-fashion" via successive byte spans (equivalent
to PostgreSQL's bytea), and yields a digest of various length at the
end. The current md5() and pgcrypto.digest() functions roll the x1
init, xN process, and x1 finish into a single call, processing a
single bytea (or perhaps more intelligently for TOAST'ed values, the
2K "rows" of those in streaming-fashion, hopefully. Can a dev
confirm?). As an aggregate, the processing is extended to all values
aggregated. That's it. Obviously order-sensitive, so an explicit ORDER
BY inside the aggregate call is DE RIGEUR, but that's normal. As I
mentioned already, SQLite supports sha3_agg() for almost a year, and
had sha(1|3)_query, which is conceptually similar (although hashes
value types too, since multi-column and dynamically typed), for years
(8+ for sha3, probably decades for sha1).

Basically anyone who knows hashing/digests and has ever written an
aggregate UDF (in SQLite or elsewhere), understands what I'm talking
about. --DD






^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Aggregate versions of hashing functions (md5, sha1, etc...)
@ 2025-07-11 11:46  Ron Johnson <[email protected]>
  parent: Dominique Devienne <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Ron Johnson @ 2025-07-11 11:46 UTC (permalink / raw)
  To: pgsql-general

On Fri, Jul 11, 2025 at 5:01 AM Dominique Devienne <[email protected]>
wrote:

> On Thu, Jul 10, 2025 at 7:11 PM Ron Johnson <[email protected]>
> wrote:
> > On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver <
> [email protected]> wrote:
> >> On 7/10/25 04:48, Dominique Devienne wrote:
> >> > Seems so logical to me, that these hashing functions were available
> >> > are aggregates, I can't be the first one to think of that, can it?
> >>
> >> I've been on this list since late 2002 and I don't recall this ever
> >> being brought up. Now it is entirely possible that age has dimmed my
> >> recall abilities:) Though a quick search seems to confirm my memory.
> >
> > What even is an aggregate hash function?  (I can imagine a few
> possibilities, but don't want to assume.)
>
> Well, it's so obvious to me, I wonder if you're baiting me :)
>
> Any hasher/digest inits some internal state, processes bytes,
> typically in "streaming-fashion" via successive byte spans (equivalent
> to PostgreSQL's bytea), and yields a digest of various length at the
> end. The current md5() and pgcrypto.digest() functions roll the x1
> init, xN process, and x1 finish into a single call, processing a
> single bytea (or perhaps more intelligently for TOAST'ed values, the
> 2K "rows" of those in streaming-fashion, hopefully. Can a dev
> confirm?). As an aggregate, the processing is extended to all values
> aggregated.


So it "appends" all the fields into one (virtual) mega-structure and takes
the hash on that.

It's what I expected but wanted to verify.

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Aggregate versions of hashing functions (md5, sha1, etc...)
@ 2025-07-11 12:11  Dominique Devienne <[email protected]>
  parent: Dominique Devienne <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Dominique Devienne @ 2025-07-11 12:11 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: pgsql-general

On Fri, Jul 11, 2025 at 11:00 AM Dominique Devienne <[email protected]> wrote:
> The current md5() and pgcrypto.digest() functions roll the x1
> init, xN process, and x1 finish into a single call, processing a
> single bytea (or perhaps more intelligently for TOAST'ed values, the
> 2K "rows" of those in streaming-fashion, hopefully. Can a dev confirm?)

FWIW, I've [asked ChatGPT about that][1], and assuming it's right (md5
and pgcrypto.digest not leveraging the "substring-optimization" on
TOASTED bytea), that's an unfortunate lost opportunity, especially for
byteas reaching close to the 1GB limit. And again (sorry to lay it on
thick...), when required to manually chunk for sizes > 1GB, the lack
of aggregate is a bit crippling, I'm afraid.

So again, can a dev confirm what ChatGPT blurted out?

And if true, any interest in improving that for better TOAST support
for true streaming hashing for current scalar digests?

And of course, the main point of this thread, add (true streaming)
aggregate support in a future version?

Thanks, --DD

[1]: https://chatgpt.com/share/6870fe03-416c-800e-8633-a76e478a794a






^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2025-07-11 12:11 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-11 09:00 Re: Aggregate versions of hashing functions (md5, sha1, etc...) Dominique Devienne <[email protected]>
2025-07-11 11:46 ` Ron Johnson <[email protected]>
2025-07-11 12:11 ` Dominique Devienne <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox