Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bDAhr-0004xE-JB for pgsql-performance@arkaria.postgresql.org; Wed, 15 Jun 2016 13:18:07 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1bDAhq-0001E0-In for pgsql-performance@arkaria.postgresql.org; Wed, 15 Jun 2016 13:18:06 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1bDAgB-0007n5-1c for pgsql-performance@postgresql.org; Wed, 15 Jun 2016 13:16:23 +0000 Received: from proofpoint1.mail.rice.edu ([128.42.201.100] helo=pp1.rice.edu) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1bDAg5-0001wm-Or for pgsql-performance@postgresql.org; Wed, 15 Jun 2016 13:16:22 +0000 Received: from pps.filterd (pp1.rice.edu [127.0.0.1]) by pp1.rice.edu (8.15.0.59/8.15.0.59) with SMTP id u5FDEOwq022652; Wed, 15 Jun 2016 08:16:14 -0500 Received: from mh3.mail.rice.edu (mh3.mail.rice.edu [128.42.199.10]) by pp1.rice.edu with ESMTP id 23k4jt025j-1; Wed, 15 Jun 2016 08:16:14 -0500 X-Virus-Scanned: by amavis-2.7.0 at mh3.mail.rice.edu, auth channel X-SMTP-Auth: no X-SMTP-Auth: no Received: from aart.rice.edu (aart.rice.edu [168.7.56.48]) by mh3.mail.rice.edu (Postfix) with ESMTP id 2D6AB40410; Wed, 15 Jun 2016 08:16:14 -0500 (CDT) Received: by aart.rice.edu (Postfix, from userid 18612) id 2236410084F; Wed, 15 Jun 2016 08:16:14 -0500 (CDT) Date: Wed, 15 Jun 2016 08:16:14 -0500 From: "ktm@rice.edu" To: Ivan Voras Cc: postgres performance list Subject: Re: Indexes for hashes Message-ID: <20160615131614.GM25300@aart.rice.edu> References: <20160615130307.GL25300@aart.rice.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-12-10) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1606150144 X-Pg-Spam-Score: -5.6 (-----) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote: > On 15 June 2016 at 15:03, ktm@rice.edu wrote: > > > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > > > Hi, > > > > > > I have an application which stores a large amounts of hex-encoded hash > > > strings (nearly 100 GB of them), which means: > > > > > > - The number of distinct characters (alphabet) is limited to 16 > > > - Each string is of the same length, 64 characters > > > - The strings are essentially random > > > > > > Creating a B-Tree index on this results in the index size being larger > > than > > > the table itself, and there are disk space constraints. > > > > > > I've found the SP-GIST radix tree index, and thought it could be a good > > > match for the data because of the above constraints. An attempt to create > > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes > > more > > > than 12 hours (while a similar B-tree index takes a few hours at most), > > so > > > I've interrupted it because "it probably is not going to finish in a > > > reasonable time". Some slides I found on the spgist index allude that > > both > > > build time and size are not really suitable for this purpose. > > > > > > My question is: what would be the most size-efficient index for this > > > situation? > > > > Hi Ivan, > > > > If the strings are really random, then maybe a function index on the first > > 4, 8, or 16 characters could be used to narrow the search space and not > > need > > to index all 64. If they are not "good" random numbers, you could use a > > hash > > index on the strings. It will be much smaller since it currently uses a > > 32-bit > > hash. It has a number of caveats and is not currently crash-safe, but it > > seems > > like it might work in your environment. You can also use a functional > > index on > > a hash-function applied to your values with a btree to give you crash > > safety. > > > > > Hi, > > I figured the hash index might be helpful and I've tried it in the > meantime: on one of the smaller tables (which is 51 GB in size), a btree > index is 32 GB, while the hash index is 22 GB (so btree is around 45% > larger). > > I don't suppose there's an effort in progress to make hash indexes use WAL? > :D Hi Ivan, Several people have looked at it but it has not made it to the top of anyone's to-do list. So if you need WAL and crash-safety, a functional index on a hash of your values is currently your best bet. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance