Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uZwY9-00B0Og-NI for pgsql-general@arkaria.postgresql.org; Thu, 10 Jul 2025 19:03:17 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uZwY7-00GGXr-Pk for pgsql-general@arkaria.postgresql.org; Thu, 10 Jul 2025 19:03:16 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uZwY7-00GGXj-E8 for pgsql-general@lists.postgresql.org; Thu, 10 Jul 2025 19:03:16 +0000 Received: from mail-lf1-x12e.google.com ([2a00:1450:4864:20::12e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uZwY6-006tJh-02 for pgsql-general@lists.postgresql.org; Thu, 10 Jul 2025 19:03:15 +0000 Received: by mail-lf1-x12e.google.com with SMTP id 2adb3069b0e04-553d52cb80dso1332873e87.1 for ; Thu, 10 Jul 2025 12:03:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1752174193; x=1752778993; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=mvdqXSavfC+nPRYtHCgojGq0xkwbHcbZfKpuaYd3Z08=; b=GyL9Z2pTQ1m9OGJQofQ1Mn8SaxjaBW2CkdmDcaONlRDfW+XuTc+6bHW6j/kLhNNaYI mVUQMIikLdfSTLjGINahbYm2Bnvf3DUU8HcCYw5F2Fwas9VsmIkTGDvzLJ95Wh3lEcIy mpCcExRZ54fSc55+gKGYQkDq+yhHgGeK1B4CuECKt5gweLAs0+bZuyDpm7PEDKYJVvTE RXJTXFps9HCOAz9tv84k2ztaYHXK5Zvq4DLrxAo9uuXtlMJzHmYffwnCOSwns+pZAD5T ORLCVJUod40CrjOQbx/t7BZC6PzrnJP0Ef6BFXJFNooCsqR+lVfkH8KFNyGVJH2X6Rbg xoyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752174193; x=1752778993; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mvdqXSavfC+nPRYtHCgojGq0xkwbHcbZfKpuaYd3Z08=; b=rDmtDM7fzxSKN5MsIBRSlwb3Hk5htqg9agg+DHPyZothStKXtaAgSN7YZYlt8sLagJ Ym38cAG/xUTpm8oev+vOdRD87kxxqKh7GuQ5EYDOtp57H/NhCvIYX8+bbT2N1M/1PLXU NxP1epPwy/Pmf2UmRsGq8I0+KWTKVFoPShV2OAUmIrYWKHd3LxcDfEKZpaerxEMlTm3M a10FGN4gB2f1EyqlFnPsyWbtsTBV5v8Tq6UdoNU4eQfP2Kipm0VlHMLlwU19F8FOuL+i pozE7IYqFnp5NBc2CZP4nqyBD1wkzbkASFebLRkmoGAyialM3tWDHf0dAmOQp0IOc0fi y/vg== X-Gm-Message-State: AOJu0Yx5DhCfhGtBM9GaExXMVNayxPIU9SKRGYWiiGj/MZPfiC0aZsdX OrJmlmSPOIJvIaEXip3/6x0yWzinf6r5fnWj473z7bZF9DTMT/OKXY4AKNZMMJNA71DKA7Wd9Kt v41qdy7JgLzNWLKJrDjqc6+ygeI2ePF4TeqNA X-Gm-Gg: ASbGncs7/CditgnYzKfX4OL0J+Bu6hBuq/HRzPsq23Qwh+jBhDujJy/lNFxx2/vCZ6v DeDAlUeB3A9q49GM2UFwXK2nRLBJFpnHlrGc0qlsCune+iJqDHvZlji7/JKduOlhlkwP5DljEeq WWR8LARD0jfpci6MNxpyYlb3x4X/t0oyC6ASUNMrsvQHg= X-Google-Smtp-Source: AGHT+IFvf/JTqJLHcPWlfXbnYQsDPzGP/UrOffrwIK3WgGijC7tS3RoeczZCBVT0/pA34qQSSp8Q0NziJEkQnjlIicc= X-Received: by 2002:a05:6512:1101:b0:558:faa8:f667 with SMTP id 2adb3069b0e04-559006a21b7mr1592318e87.5.1752174192551; Thu, 10 Jul 2025 12:03:12 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Merlin Moncure Date: Thu, 10 Jul 2025 13:02:59 -0600 X-Gm-Features: Ac12FXzOXnRYYzOY7Zi3EWSs6XHP-2IsYi1myGDGC9pNL-xnQhNe7EqWUINMVaI Message-ID: Subject: Re: Aggregate versions of hashing functions (md5, sha1, etc...) To: Dominique Devienne Cc: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="00000000000054ebaa063997d883" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000054ebaa063997d883 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Jul 10, 2025 at 5:48=E2=80=AFAM Dominique Devienne wrote: > We store scientific information in PostgreSQL, and some of that is > bytea and large, thus we must "chunk it" both for performance, and not > be limited to 1GB (we do exceed that, in rare occasions). > > Recently I added md5/sha1 hashing support for such values (for various > reasons, to track corruptions in our ETLs, now fixed, but also in the > future for custom smart sync and such), and was surprised to discover > there are no aggregate versions of those functions, neither the > built-in md5 one (now with bytea overload), nor for pgcrypto. Did I > miss something? > > Any chance this might be added in the future? > > Seems so logical to me, that these hashing functions were available > are aggregates, I can't be the first one to think of that, can it? > > Thanks, --DD > > PS: The holly-grail IMHO, would be built-in support for hashing, with > intelligent lazy-compute and persistence correctly reset on changes. > Probably opt-in. > > PPS: Built-in Support for xxh64, or at least in official pgcrypto, > would also be nice. With aggregate! > > PPPS: I remember Oracle LOBs supporting a built-in implicit mime-type > attribute on them, which was used as the content-type with served over > the built-in WebDAV extension for the DB. I wish PostgreSQL had > something like that too. > > PPPPS: the lo extension is not viable for us, FWIW. All the above > should be opt-in on bytea columns IMHO. > Does it have to match the md5 of the 'unchunked' variant exactly? If not, maybe you can rig a custom aggregate that would just "hash amend" the chunks using the n-1 chunk has as salt, this would be fast and easy, at the cost of not matching the unchunked variant. I had to do something very similar with compression, I had a need to be able to compress bytea values with lz4 at SQL level due to limited support for extensions in the cloud. it works great...if a bit slow, and requires the database to handle the decompression. merlin --00000000000054ebaa063997d883 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Jul 10, 2025 at 5:48=E2=80=AFAM D= ominique Devienne <ddevienne@gmai= l.com> wrote:
=
We store scientific infor= mation in PostgreSQL, and some of that is
bytea and large, thus we must "chunk it" both for performance, an= d not
be limited to 1GB (we do exceed that, in rare occasions).
<= div>=C2=A0
Recently I added md5/sha1 hashing support for such values (for various
reasons, to track corruptions in our ETLs, now fixed, but also in the
future for custom smart sync and such), and was surprised to discover
there are no aggregate versions of those functions, neither the
built-in md5 one (now with bytea overload), nor for pgcrypto. Did I
miss something?

Any chance this might be added in the future?

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

Thanks, --DD

PS: The holly-grail IMHO, would be built-in support for hashing, with
intelligent lazy-compute and persistence correctly reset on changes.
Probably opt-in.

PPS: Built-in Support for xxh64, or at least in official pgcrypto,
would also be nice. With aggregate!

PPPS: I remember Oracle LOBs supporting a built-in implicit mime-type
attribute on them, which was used as the content-type with served over
the built-in WebDAV extension for the DB. I wish PostgreSQL had
something like that too.

PPPPS: the lo extension is not viable for us, FWIW. All the above
should be opt-in on bytea columns IMHO.

Does it have to match the md5 of the 'unchunked' variant exactly? = If not, maybe you can rig a custom aggregate that would just "hash ame= nd" the chunks using the n-1 chunk has as salt, this would be fast and= easy, at the cost of not matching the unchunked variant.

I had to do something very similar with compression, I had a need t= o be able to compress bytea values with lz4 at SQL level due to limited sup= port for extensions in the cloud.=C2=A0 it works great...if a bit slow, and= requires the database to handle the decompression.

merlin
=C2=A0
--00000000000054ebaa063997d883--