Re: Indexes for hashes - ktm@rice.edu

public inbox for [email protected]  
help / color / mirror / Atom feed

From: [email protected] <[email protected]>
To: Ivan Voras <[email protected]>
Cc: postgres performance list <[email protected]>
Subject: Re: Indexes for hashes
Date: Wed, 15 Jun 2016 08:16:14 -0500
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAF-QHFVq9XFAqA2GhT15btmvQBErNjLpaChPrx3BGmTRpu320w@mail.gmail.com>
References: <CAF-QHFULmVOdrqwtR7AKRnnx6=GbAW7S6v6f4jACEOVENef7NA@mail.gmail.com>
	<[email protected]>
	<CAF-QHFVq9XFAqA2GhT15btmvQBErNjLpaChPrx3BGmTRpu320w@mail.gmail.com>
List-Unsubscribe:  <mailto:[email protected]?body=unsub%20pgsql-performance>

On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote:
> On 15 June 2016 at 15:03, [email protected] <[email protected]> wrote:
> 
> > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> > > Hi,
> > >
> > > I have an application which stores a large amounts of hex-encoded hash
> > > strings (nearly 100 GB of them), which means:
> > >
> > >    - The number of distinct characters (alphabet) is limited to 16
> > >    - Each string is of the same length, 64 characters
> > >    - The strings are essentially random
> > >
> > > Creating a B-Tree index on this results in the index size being larger
> > than
> > > the table itself, and there are disk space constraints.
> > >
> > > I've found the SP-GIST radix tree index, and thought it could be a good
> > > match for the data because of the above constraints. An attempt to create
> > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes
> > more
> > > than 12 hours (while a similar B-tree index takes a few hours at most),
> > so
> > > I've interrupted it because "it probably is not going to finish in a
> > > reasonable time". Some slides I found on the spgist index allude that
> > both
> > > build time and size are not really suitable for this purpose.
> > >
> > > My question is: what would be the most size-efficient index for this
> > > situation?
> >
> > Hi Ivan,
> >
> > If the strings are really random, then maybe a function index on the first
> > 4, 8, or 16 characters could be used to narrow the search space and not
> > need
> > to index all 64. If they are not "good" random numbers, you could use a
> > hash
> > index on the strings. It will be much smaller since it currently uses a
> > 32-bit
> > hash. It has a number of caveats and is not currently crash-safe, but it
> > seems
> > like it might work in your environment. You can also use a functional
> > index on
> > a hash-function applied to your values with a btree to give you crash
> > safety.
> >
> >
> Hi,
> 
> I figured the hash index might be helpful and I've tried it in the
> meantime: on one of the smaller tables (which is 51 GB in size), a btree
> index is 32 GB, while the hash index is 22 GB (so btree is around 45%
> larger).
> 
> I don't suppose there's an effort in progress to make hash indexes use WAL?
> :D

Hi Ivan,

Several people have looked at it but it has not made it to the top of anyone's
to-do list. So if you need WAL and crash-safety, a functional index on a hash
of your values is currently your best bet.

Regards,
Ken


-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Indexes for hashes
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox