Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rtqEb-000FVY-Nk for pgsql-general@arkaria.postgresql.org; Mon, 08 Apr 2024 14:44:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rtqEa-007mqH-QV for pgsql-general@arkaria.postgresql.org; Mon, 08 Apr 2024 14:44:32 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rtqEa-007mq8-Fu for pgsql-general@lists.postgresql.org; Mon, 08 Apr 2024 14:44:32 +0000 Received: from mail-lj1-x235.google.com ([2a00:1450:4864:20::235]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rtqEX-001q2F-Qe for pgsql-general@lists.postgresql.org; Mon, 08 Apr 2024 14:44:31 +0000 Received: by mail-lj1-x235.google.com with SMTP id 38308e7fff4ca-2d48d75ab70so61686871fa.0 for ; Mon, 08 Apr 2024 07:44:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712587468; x=1713192268; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=AU2NpQQDJSDjKp3tb59H45AgtvI1haegvONTWpEiD2s=; b=atbJZbzKgs1hUqkem5UNYMW+UYx489Ist0/40fjmhI24D7E16SNHVz1SocH1jj1v0B I/ZfZwe50rx/l7OBcyrEA8BOTCuMtberLXXxJMumhxZ2Cop62PIAVTVehsNJgaTu68Rk 0qVaKYbGDWl0XgQw2RCfrYCbvlGxi2t4791oCdrIXnWsRUM5E0eKTnTgYlg80VvSRcIC UvxaWU+ErdjPbePl6asSy37mTSUnAe+6YUE3njUm0fJ4oSC6TdiDQk4684xDxl4Twk5c mw80azBIaUhVxIOJMDb8QpU/+zwsdMDNnCKRrHkMmy33J6MtRFA01ZrNPSX2L5pvNt9x Gthg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712587468; x=1713192268; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AU2NpQQDJSDjKp3tb59H45AgtvI1haegvONTWpEiD2s=; b=RYr9q1KpkL6ohiRPwuBDr+KBXpnJkwljOkVg7Hmr7MAIJCyGwIVnaDpfVRRLLO1HSW OxxUhzUVcLHuU2ZRi6dhElVxJkvXIXqK/t6rZu+mCskGy8mfHz70uXrLYxJ42QEwq3X3 NiTLcIxsg0/VB6fUj8XV8h/s5ROKdfO+qPH3PomieFDJ8G6drKqQDI7hfL7jGL9UTkYH o13aOOFgOVormqX8/VAZhDO+it97vRKDxEEjGY5/WBTUpSERRGs45cxIIM4L0d55tWAB 3VIbVhXP4YmJtP/lZffU7jeV87U4xgOEhnF7nlFVPxoJZECvGXOBignpGRAHFr2eZvoy Snzg== X-Gm-Message-State: AOJu0YwxVcM1cLdSeK+OoUteJvOw6Xk4SgK50V5WEyv2OIERccEu0CM7 vq6IQdbFR90S5IguGmPBhYCDDaoEzXoClGgHV6Ua6RFJ+BLI0sR+T+w0hUyIQfkrg8r5QJAOiYc XYSMqm4Io2FD/LyM2z1ryM/cshLY= X-Google-Smtp-Source: AGHT+IE1Xus6w07JcKqyFDa7CZgixpKP4ockLqAm36rHr6ixmPocj0QRm9CQPBV7bfgAkO/iSFs4QoqmcedcX0qy+cU= X-Received: by 2002:a2e:a9a2:0:b0:2d5:9703:263f with SMTP id x34-20020a2ea9a2000000b002d59703263fmr6257035ljq.44.1712587467916; Mon, 08 Apr 2024 07:44:27 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Greg Sabino Mullane Date: Mon, 8 Apr 2024 10:43:51 -0400 Message-ID: Subject: Re: Storing and comparing columns of cryptographic hashes? To: Josh Triplett Cc: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000ac0d17061596d727" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ac0d17061596d727 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Apr 8, 2024 at 10:08=E2=80=AFAM Josh Triplett wrote: > - Is there a way to tell postgresql "this column contains cryptographic > hashes, so you can do hash joins using any subset of the bits, > without having to hash them again"? If not, should there be? > No, and no. (if I understand your question correctly). You could use a functional index, I suppose, but seems premature optimization. > - Is `bit(256)` the right type to use to store 32-byte hash values with n= o > additional overhead? > No, you would want bytea. I would store the value in a TEXT field, unless you really worried about space savings. The hexadecimal value will be far easier to debug and work with, and you can use a simple b-tree index. - What would be the simplest way, given an input array of hashes (which > I may have to pass in as an array and use `unnest`), to filter out all > the values that already exist, *and* generate a corresponding bitmap > in the same order for present/not-present for the entire array (to > send back to the user)? Filtering seems easy enough, but generating > the bitmap less so. > Something like this: SELECT array_agg(case when t.bhash is null then 1 else 0 end) from unnest(array['blakehash1', 'blakehash2', etc...]) as a(x) left join mytable t on t.bhash =3D a.x; > - Does it make more sense to store the values as one row per value, or > as one row per group of values? Hard to answer without knowing more, but I'd lean towards simple and one row per value. Your proposal (query db, do external work, update db) also sets of lots of concurrency red flags, so be mindful of that. Cheers, Greg --000000000000ac0d17061596d727 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Mon, Apr 8, 2024 at 10:08=E2=80=AFAM J= osh Triplett <josh@joshtriplett= .org> wrote:
- Is there a way to tell postgresql "= this column contains cryptographic hashes, so you can do hash joins using a= ny subset of the bits, without=C2=A0having to hash them again"? If not= , should there be?

No, and no. (if I un= derstand=C2=A0your question correctly). You could use a functional index, I= suppose, but seems premature optimization.
=C2=A0
- Is `bit(256)` the right type to use to store 32-byte hash values with no = additional overhead?

No, you would want= bytea. I would store the value in a TEXT field, unless you really worried = about space savings. The hexadecimal value will be far easier to debug and = work with, and you can use a simple b-tree index.

- What would be the simplest w= ay, given an input array of hashes (which
=C2=A0 I may have to pass in as an array and use `unnest`), to filter out a= ll
=C2=A0 the values that already exist, *and* generate a corresponding bitmap=
=C2=A0 in the same order for present/not-present for the entire array (to =C2=A0 send back to the user)? Filtering seems easy enough, but generating<= br> =C2=A0 the bitmap less so.

Something like this:

<= div>SELECT array_agg(case when t.bhash is null then 1 else 0 end)
from unnest(array['blakehash1', 'blakehash2', etc...]) as = a(x)
left join mytable t on t.bhash =3D a.x;
=C2=A0
=
- Does it make more= sense to store the values as one row per value, or
=C2=A0 as one row per group of values?

Hard= to answer without knowing more, but I'd lean towards simple and one ro= w per value.

Your proposal (query db, do external = work, update db) also sets of lots of concurrency red flags, so be mindful = of that.

Cheers,
Greg

--000000000000ac0d17061596d727--