public inbox for [email protected]help / color / mirror / Atom feed
Re: Aggregate versions of hashing functions (md5, sha1, etc...) 3+ messages / 2 participants [nested] [flat]
* Re: Aggregate versions of hashing functions (md5, sha1, etc...) @ 2025-07-11 09:00 Dominique Devienne <[email protected]> 0 siblings, 2 replies; 3+ messages in thread From: Dominique Devienne @ 2025-07-11 09:00 UTC (permalink / raw) To: Ron Johnson <[email protected]>; +Cc: pgsql-general On Thu, Jul 10, 2025 at 7:11 PM Ron Johnson <[email protected]> wrote: > On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver <[email protected]> wrote: >> On 7/10/25 04:48, Dominique Devienne wrote: >> > Seems so logical to me, that these hashing functions were available >> > are aggregates, I can't be the first one to think of that, can it? >> >> I've been on this list since late 2002 and I don't recall this ever >> being brought up. Now it is entirely possible that age has dimmed my >> recall abilities:) Though a quick search seems to confirm my memory. > > What even is an aggregate hash function? (I can imagine a few possibilities, but don't want to assume.) Well, it's so obvious to me, I wonder if you're baiting me :) Any hasher/digest inits some internal state, processes bytes, typically in "streaming-fashion" via successive byte spans (equivalent to PostgreSQL's bytea), and yields a digest of various length at the end. The current md5() and pgcrypto.digest() functions roll the x1 init, xN process, and x1 finish into a single call, processing a single bytea (or perhaps more intelligently for TOAST'ed values, the 2K "rows" of those in streaming-fashion, hopefully. Can a dev confirm?). As an aggregate, the processing is extended to all values aggregated. That's it. Obviously order-sensitive, so an explicit ORDER BY inside the aggregate call is DE RIGEUR, but that's normal. As I mentioned already, SQLite supports sha3_agg() for almost a year, and had sha(1|3)_query, which is conceptually similar (although hashes value types too, since multi-column and dynamically typed), for years (8+ for sha3, probably decades for sha1). Basically anyone who knows hashing/digests and has ever written an aggregate UDF (in SQLite or elsewhere), understands what I'm talking about. --DD ^ permalink raw reply [nested|flat] 3+ messages in thread
* Re: Aggregate versions of hashing functions (md5, sha1, etc...) @ 2025-07-11 11:46 Ron Johnson <[email protected]> parent: Dominique Devienne <[email protected]> 1 sibling, 0 replies; 3+ messages in thread From: Ron Johnson @ 2025-07-11 11:46 UTC (permalink / raw) To: pgsql-general On Fri, Jul 11, 2025 at 5:01 AM Dominique Devienne <[email protected]> wrote: > On Thu, Jul 10, 2025 at 7:11 PM Ron Johnson <[email protected]> > wrote: > > On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver < > [email protected]> wrote: > >> On 7/10/25 04:48, Dominique Devienne wrote: > >> > Seems so logical to me, that these hashing functions were available > >> > are aggregates, I can't be the first one to think of that, can it? > >> > >> I've been on this list since late 2002 and I don't recall this ever > >> being brought up. Now it is entirely possible that age has dimmed my > >> recall abilities:) Though a quick search seems to confirm my memory. > > > > What even is an aggregate hash function? (I can imagine a few > possibilities, but don't want to assume.) > > Well, it's so obvious to me, I wonder if you're baiting me :) > > Any hasher/digest inits some internal state, processes bytes, > typically in "streaming-fashion" via successive byte spans (equivalent > to PostgreSQL's bytea), and yields a digest of various length at the > end. The current md5() and pgcrypto.digest() functions roll the x1 > init, xN process, and x1 finish into a single call, processing a > single bytea (or perhaps more intelligently for TOAST'ed values, the > 2K "rows" of those in streaming-fashion, hopefully. Can a dev > confirm?). As an aggregate, the processing is extended to all values > aggregated. So it "appends" all the fields into one (virtual) mega-structure and takes the hash on that. It's what I expected but wanted to verify. -- Death to <Redacted>, and butter sauce. Don't boil me, I'm still alive. <Redacted> lobster! ^ permalink raw reply [nested|flat] 3+ messages in thread
* Re: Aggregate versions of hashing functions (md5, sha1, etc...) @ 2025-07-11 12:11 Dominique Devienne <[email protected]> parent: Dominique Devienne <[email protected]> 1 sibling, 0 replies; 3+ messages in thread From: Dominique Devienne @ 2025-07-11 12:11 UTC (permalink / raw) To: Ron Johnson <[email protected]>; +Cc: pgsql-general On Fri, Jul 11, 2025 at 11:00 AM Dominique Devienne <[email protected]> wrote: > The current md5() and pgcrypto.digest() functions roll the x1 > init, xN process, and x1 finish into a single call, processing a > single bytea (or perhaps more intelligently for TOAST'ed values, the > 2K "rows" of those in streaming-fashion, hopefully. Can a dev confirm?) FWIW, I've [asked ChatGPT about that][1], and assuming it's right (md5 and pgcrypto.digest not leveraging the "substring-optimization" on TOASTED bytea), that's an unfortunate lost opportunity, especially for byteas reaching close to the 1GB limit. And again (sorry to lay it on thick...), when required to manually chunk for sizes > 1GB, the lack of aggregate is a bit crippling, I'm afraid. So again, can a dev confirm what ChatGPT blurted out? And if true, any interest in improving that for better TOAST support for true streaming hashing for current scalar digests? And of course, the main point of this thread, add (true streaming) aggregate support in a future version? Thanks, --DD [1]: https://chatgpt.com/share/6870fe03-416c-800e-8633-a76e478a794a ^ permalink raw reply [nested|flat] 3+ messages in thread
end of thread, other threads:[~2025-07-11 12:11 UTC | newest] Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2025-07-11 09:00 Re: Aggregate versions of hashing functions (md5, sha1, etc...) Dominique Devienne <[email protected]> 2025-07-11 11:46 ` Ron Johnson <[email protected]> 2025-07-11 12:11 ` Dominique Devienne <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox