Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ua9mo-00H9Ya-Jv for pgsql-general@arkaria.postgresql.org; Fri, 11 Jul 2025 09:11:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ua9mm-0034Xe-NQ for pgsql-general@arkaria.postgresql.org; Fri, 11 Jul 2025 09:11:17 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ua9mm-0034XW-CO for pgsql-general@lists.postgresql.org; Fri, 11 Jul 2025 09:11:17 +0000 Received: from mail-oi1-x22a.google.com ([2607:f8b0:4864:20::22a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1ua9mi-006ze3-2N for pgsql-general@lists.postgresql.org; Fri, 11 Jul 2025 09:11:14 +0000 Received: by mail-oi1-x22a.google.com with SMTP id 5614622812f47-4079f80ff0fso1807373b6e.1 for ; Fri, 11 Jul 2025 02:11:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1752225070; x=1752829870; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aBOeQNoLbiId6M7zvBnOnevcqT14wlLECZVx85sVJa8=; b=Mnzf9OD3N0N1P/ZapnUdyqtHZ47QLXA1QIFJhyIBvmYAdRxMBYj5a3NPDkmFQfqpho 6bKOktNMnjvx4NNF0FEtRmlnVaA1AzpxlPX4w0AevS0yeZr2Uk+teoNp2qnNL9Z2cITD u47F9EISFI/+FmX/2OD7+AoJNSjZ1M+kWXyv9SNlrKX/ajotYMg4XJcsB2xIguIaHDY2 cVK0rAIjB96oDXUmWCPhZiN5e6A2/MWltgxR1TqbOuUe4anplTz/ZSCfWFz0z9k2A75f GI43cX/Woh3pUk+ajcRhS3g8sQ4LRcWnZsFVL3yNWJR1RxvUQ+O2Z7aAqBfOS+Z/mxlK H3Nw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752225070; x=1752829870; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aBOeQNoLbiId6M7zvBnOnevcqT14wlLECZVx85sVJa8=; b=Nk2ChgrdRAc2yzwpDLeYh4I1UZSE8GqncWXwgAxz/RwQm/FIKMYVCXk9GFRqg3KhBd WKylWOF8xlroAKJISjtcfu9yU+KDn44bsAHd7+ghxgX9B8YPVJr/LCnSyeTCBv02D/nj yPgNEnH2cU6leD8cP31CIOL8WB9PseIm9xxnTj4kvIja5YEhHQCZT09vUzIZCjIkU7Bk O6BvbxK9VNTyvaD+tWBolqyfb86EJVZfqgq37Z4Vo8l2gh2WvLOpfxceNvgj0U6LZCJm LVkmZ9VKKHjwWx9fmMMfqHKoW4laKhKVA0HbrBa21K5Td724GY74cYEj6NP0jc3Y8MD6 S6gg== X-Gm-Message-State: AOJu0Yw3GgziOXb/E2WXAKPxg3wt6W4zgXpKciPbPPfwuzCVFAiAsAZd daXQpugBk6E5+/08FyDTGCzLt/murXbEMnN15Uc2f+gn2xwgsEbZPbzNWp5qDtmUzkHxqCGZjM1 x98pmCebKUK3+8f+6jI8AIsr8ZCmAryBm9HsD X-Gm-Gg: ASbGnctmAJXu5CC1O/Nz3U3vQAr7mAviN47eecriZMNpec5kL+oczP6HA1/0YOBB5yV AsEgScfHu9EHT4ubEvhMPwzv/NH1dwi3Z/YNsCB7gvVI9DEIh7qYEKqqDAJICN+flfP6FvydhyX CS5UemyP/xGHOT4DkdKxDsdnQvaAGrXSivA+KGl3pAV3Uq2lSJ3kyLDDC3oM8oRuISCTE6KJVz6 8nPSlImSQ== X-Google-Smtp-Source: AGHT+IH2JQV9busIc3zbHiOmup2Yse41CAEKlJ0SOXZKjhh17TOWeufAl6KRiNF3DFzZqcTHQbKpM/Es1wl8r8jdzkM= X-Received: by 2002:a05:6808:220f:b0:3f8:bbf3:3a18 with SMTP id 5614622812f47-413c4890f1cmr4790026b6e.16.1752225070207; Fri, 11 Jul 2025 02:11:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dominique Devienne Date: Fri, 11 Jul 2025 11:10:58 +0200 X-Gm-Features: Ac12FXzvRlfPmLoLxhYy2SnwJOt3PzVgEStbTLRWafdDpXdGg95zWEH1yOPlz68 Message-ID: Subject: Re: Aggregate versions of hashing functions (md5, sha1, etc...) To: Merlin Moncure Cc: pgsql-general@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Jul 10, 2025 at 9:03=E2=80=AFPM Merlin Moncure = wrote: > On Thu, Jul 10, 2025 at 5:48=E2=80=AFAM Dominique Devienne wrote: >> We store scientific information in PostgreSQL, and some of that is >> bytea and large, thus we must "chunk it" both for performance, and not >> be limited to 1GB (we do exceed that, in rare occasions). > > Does it have to match the md5 of the 'unchunked' variant exactly? Well, yeah, ideally. > If not, maybe you can rig a custom aggregate that would just "hash amend" Sure, there are various ways to get a deterministic hash of such a chunked-across-rows (logical) value. But it always involves hashing the hash, which statistically might not be ideal. But more importantly, it departs from the normal "serial" hashing of the full logical hash. With the full hash, I can use various FS OS tools and SQLite aggregate-UDFs and TBD PostgreSQL aggregate-Digests to compare those hashes in natural and consistent ways. Working around lack of aggregate digests in PostgreSQL forces to replicate those work-arounds at the FS and SQLite levels, for comparisons. I.e. Not good. > I had to do something very similar with compression, I had a need to be a= ble to compress bytea values with lz4 at SQL level due to limited support f= or extensions in the cloud. it works great...if a bit slow, and requires t= he database to handle the decompression. The slow part is what I want to avoid. I explored various ways to work-around true server-side aggregate hashing, and none are satisfactory nor performant enough. This is really the kind of primitive that must be built-in to be efficient enough to be "useful". (and I'd argue compression, deduplication, etc... are similar issues. BLOBs / BYTEAs are too often "limited" in RDBMSs, with people saying store them outside the DB, and I always find so strange, and a bit of a cope out to making it performant enough and/or full features enough. A bit of a rant here, sorry ;)).