Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vtRCW-007gkX-0M for pgsql-general@arkaria.postgresql.org; Fri, 20 Feb 2026 14:09:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vtRCU-0085qs-2y for pgsql-general@arkaria.postgresql.org; Fri, 20 Feb 2026 14:09:46 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vslsO-00HEe7-0m for pgsql-general@lists.postgresql.org; Wed, 18 Feb 2026 18:02:16 +0000 Received: from sosiego.soundray.org ([116.203.207.114]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vslsL-00000001Fo1-2iFA for pgsql-general@postgresql.org; Wed, 18 Feb 2026 18:02:15 +0000 From: Linus Heckemann DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=schreibt.jetzt; s=sosiego; t=1771437732; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to; bh=gpYGreVx4rie2eyrBC5FMlPPCTjyMwmD+q+cDUTKnG8=; b=hgbiP7ohNj/U0miNSM7pEGAEDT7xJxro+RoCzkOHp37tIYtH8u9IfZqL6tUaN/iX2324nm 6PPm0dWkGpYIHK+tNgU4qBcNHhaI0sIqUIZ4Jrj+iFrvSZ2xIvVs8IHOP3u4FMxBES1APv 7VIz4EiFDqmIMMVP4b4/SHVkXJk9cGI= To: pgsql-general mailing list Subject: Re: Can we get sha* function over text, that could be used in index? In-Reply-To: CANzqJaD7mjDcXKNcESLJ+oOd0GDZQvCc45hvGc7LD+zW9COKUg@mail.gmail.com Date: Wed, 18 Feb 2026 19:02:10 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, I'm the one who asked :) Ron Johnson wrote: > On Wed, Feb 18, 2026 at 9:58=E2=80=AFAM hubert depesz lubaczewski > wrote: >=20 > > Hi, > > So, we have various sha* functions. > > > > And I recently got asked about using them as a based for unique index > > over long texts. > > Normally one would do it with md5(text), but the person asking wanted to > > use sha(). >=20 > I think I'd push back, asking them if they really need > cryptographically-secure hashing (which they most probably don't). I indeed don't need cryptographically-secure hashing in this scenario, and I will be using md5. Nevertheless, I think using md5 in any context that isn't "to verify digests produced in the past" is a poor decision, and postgres should make using other hashes just as easy! The specific context of hashing aside, it seems weird to me that: - there is a byte-array representation of text columns, which appears to be independent of database encoding - there doesn't seem to be _any_ sane way to access this. The obvious (to a naive user, like I was) approach, casting to bytea, has exceptionally surprising behaviour: for many text strings, it does exactly what the naive user might hope for, giving back the UTF-8 representation. But multiple distinct text strings, like '\033' and '\x1b', convert to the same byte string! And text strings containing a backslash that doesn't fit the bytea hex format or the bytea escape format will fail to convert completely! The fact that convert_to() is only stable and not immutable makes sense to me given the effect that configuration can have on its behaviour, but given that there does appear to be a trivially-accessible UTF-8 representation (as used by md5()) I think there should be an immutable function that provides access to it? Is there a good reason not to? I'd be willing to send a patch for it myself. Linus