Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vuXfc-00CRoI-1F for pgsql-general@arkaria.postgresql.org; Mon, 23 Feb 2026 15:16:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vuXfb-00DljI-0y for pgsql-general@arkaria.postgresql.org; Mon, 23 Feb 2026 15:16:23 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vtjhT-009f1W-0A for pgsql-general@lists.postgresql.org; Sat, 21 Feb 2026 09:54:59 +0000 Received: from sosiego.soundray.org ([116.203.207.114]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vtjhP-00000000XvT-3Sde for pgsql-general@postgresql.org; Sat, 21 Feb 2026 09:54:58 +0000 From: Linus Heckemann DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=schreibt.jetzt; s=sosiego; t=1771667694; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RhjyubLaY78Pqaclubo8I3VHpSLIw0/QYRF1WcmXeno=; b=OXYMMxbN/VPhhhFOeNTPhZw19qYOLT2jyT3ySW9Ntmai5AqPHYv08+qhmS9knsoFajHLuR KpcZhBDxw9qW9flYDi04WnBd43L7rYxh1SWWTtGhWpuh7bSK9zCOtFEiE7G/lJ9Vm8vgMi dSOgC0pqqfpMq2nMpyqvu5QbeB0UrBs= To: Daniel Verite Cc: pgsql-general mailing list Subject: Re: Can we get sha* function over text, that could be used in index? In-Reply-To: <63a9ec31-8a5f-47c7-8a70-f1c57207ffd3@manitou-mail.org> References: <63a9ec31-8a5f-47c7-8a70-f1c57207ffd3@manitou-mail.org> Date: Sat, 21 Feb 2026 10:54:53 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk "Daniel Verite" writes: >> - there is a byte-array representation of text columns, which appears to >> be independent of database encoding > > Not sure what you're refering to. Both the on-disk and in-memory > representations of text/varchar are encoding-dependent. Huh, I thought I'd come to the opposite conclusion from testing the other day (md5('=C3=BF') giving the same results for both a utf-8 database and a latin1 database), but I must have been holding something wrong, because trying again does confirm this. Sorry about that! >> The obvious (to a naive user, like I was) approach, casting to bytea, >> has exceptionally surprising behaviour: for many text strings, it does >> exactly what the naive user might hope for, giving back the UTF-8 >> representation. But multiple distinct text strings, like '\033' and >> '\x1b', convert to the same byte string! And text strings containing a >> backslash that doesn't fit the bytea hex format or the bytea escape >> format will fail to convert completely! > > Yes. It seems a common mistake to forget or ignore that > backslashes are special in the input text representation of bytea. > It might be not obvious from reading the doc at [1] > but we just need to quote backslashes by doubling them. > > AFAIK a working solution for the OP would be: > sha256(replace(colname, '\', '\\')::bytea) That's so simple that it feels like it should have been obvious... Thanks! Nevertheless, it feels like a weird and potentially computationally expensive contortion for getting from a text value to its bytea representation. Given their encoding-dependence, shouldn't md5() (and ::bytea, for that matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's > guaranteed to return the same results given the same arguments forever only valid within the context of a single database? md5() and ::bytea aside, it seems like it would be good to have both - a function for accessing the bytea representation of a text value without any extra steps, and - an IMMUTABLE function for getting a UTF-8 (specifically UTF-8, because it's the only encoding with a character set encompassing all other supported character sets) bytea representation of a text value regardless of the database encoding? Cheers Linus