MIME-Version: 1.0
From: Ivan Voras <ivoras@gmail.com>
Date: Wed, 15 Jun 2016 11:34:18 +0200
Message-ID: 
 <CAF-QHFULmVOdrqwtR7AKRnnx6=GbAW7S6v6f4jACEOVENef7NA@mail.gmail.com>
Subject: Indexes for hashes
To: postgres performance list <pgsql-performance@postgresql.org>
Content-Type: multipart/alternative; boundary=94eb2c081fbcbe5adf05354dd588
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

--94eb2c081fbcbe5adf05354dd588
Content-Type: text/plain; charset=UTF-8

Hi,

I have an application which stores a large amounts of hex-encoded hash
strings (nearly 100 GB of them), which means:

   - The number of distinct characters (alphabet) is limited to 16
   - Each string is of the same length, 64 characters
   - The strings are essentially random

Creating a B-Tree index on this results in the index size being larger than
the table itself, and there are disk space constraints.

I've found the SP-GIST radix tree index, and thought it could be a good
match for the data because of the above constraints. An attempt to create
it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
than 12 hours (while a similar B-tree index takes a few hours at most), so
I've interrupted it because "it probably is not going to finish in a
reasonable time". Some slides I found on the spgist index allude that both
build time and size are not really suitable for this purpose.

My question is: what would be the most size-efficient index for this
situation?

--94eb2c081fbcbe5adf05354dd588
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>I have an application which stores =
a large amounts of hex-encoded hash strings (nearly 100 GB of them), which =
means:</div><div><ul><li>The number of distinct characters (alphabet) is li=
mited to 16</li><li>Each string is of the same length, 64 characters</li><l=
i>The strings are essentially random</li></ul><div>Creating a B-Tree index =
on this results in the index size being larger than the table itself, and t=
here are disk space constraints.</div></div><div><br></div><div>I&#39;ve fo=
und the SP-GIST radix tree index, and thought it could be a good match for =
the data because of the above constraints. An attempt to create it (as in C=
REATE INDEX ON t USING spgist(field_name)) apparently takes more than 12 ho=
urs (while a similar B-tree index takes a few hours at most), so I&#39;ve i=
nterrupted it because &quot;it probably is not going to finish in a reasona=
ble time&quot;. Some slides I found on the spgist index allude that both bu=
ild time and size are not really suitable for this purpose.</div><div><br><=
/div><div>My question is: what would be the most size-efficient index for t=
his situation?</div></div>

--94eb2c081fbcbe5adf05354dd588--