Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bG6sz-0004TE-3o for pgsql-performance@arkaria.postgresql.org; Thu, 23 Jun 2016 15:49:45 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1bG6sy-0007gT-I8 for pgsql-performance@arkaria.postgresql.org; Thu, 23 Jun 2016 15:49:44 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1bG6rI-0005r7-OC for pgsql-performance@postgresql.org; Thu, 23 Jun 2016 15:48:00 +0000 Received: from mail-vk0-x232.google.com ([2607:f8b0:400c:c05::232]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1bG6rG-0008ED-4E for pgsql-performance@postgresql.org; Thu, 23 Jun 2016 15:47:59 +0000 Received: by mail-vk0-x232.google.com with SMTP id j2so112132836vkg.2 for ; Thu, 23 Jun 2016 08:47:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=nA5UwzL0VHVFQupBvB7abD3ADBGfjQ9I2eoor58GjLs=; b=OySbsg+JwITFY1edx1GTDLbXeUzmvaEh1+o+gfBbIkD4Lsxem36b6+JYluiG7dTwIc bsCzvUN+FrUnf6b7kT2O9fxIdRNAKMDWG+ptnK0PIebsNU4MhH7pBuIK5SRl05trMSGK ySpkZ46eH2tqiXKTo0CBj+hvdTgWonMROpcLtpN5Q4bnwvADCkbhOI7zt9XmNwwKt4gX hp+Ryi/KqRMLdPCZFegT4PesrBSaOfEKLqIkhWYVUsxL5XS2L8weIozMRi0TK4KhJbmh PfFHJkBdFBfZjMmwytLHpKCprHhkWfxtFCmEpOo/oiVDaoQ1BK7a5rqsC1GEn6Wkylve LZLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=nA5UwzL0VHVFQupBvB7abD3ADBGfjQ9I2eoor58GjLs=; b=Sfnc2PhFl6DSgQwzreHvYFwGSXPwINsQ1tvJYRtGY/Z82qu3IQv0rhQhP4+sSE1M41 RZ/XAacFlK399k+10Ft7NnM75DgBTLQXb2b6oCNG+8fmaAp6suGYtIN4+TxaSCO1Aeh1 6e+khm5J04xwiHGy0nJFzvjOwq+FlYyTGglY+1FOYXA4GjtJrRX9zsrJivOIqYdEYwl0 ijMOVriW0FCioRMey2D6R8U2OMV0FwP/BckiXORioQBi9FzlL+CoVVvQ5uw8wGz+0c+Y bdExRPec+YhnOt6omErxrG+YfmojkQWP5SJyI42Ogf2LEftfDtglcfq/Xg4f+SKRqeKi BXZg== X-Gm-Message-State: ALyK8tLqQ6TdLc3JEUfZemTLP4P0N8RBGZrIdEeZ1EeB9z72vlG3qVXNTdfyouMb2Leo6VSEgUhHamoD659mFw== X-Received: by 10.31.87.129 with SMTP id l123mr16714208vkb.22.1466696876791; Thu, 23 Jun 2016 08:47:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.131.204 with HTTP; Thu, 23 Jun 2016 08:47:56 -0700 (PDT) In-Reply-To: References: From: Jeff Janes Date: Thu, 23 Jun 2016 08:47:56 -0700 Message-ID: Subject: Re: Can't get two index scans To: Craig James Cc: "pgsql-performance@postgresql.org" Content-Type: text/plain; charset=UTF-8 X-Pg-Spam-Score: -2.7 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org On Wed, Jun 22, 2016 at 9:36 PM, Craig James wrote: > On Wed, Jun 22, 2016 at 11:36 AM, Jeff Janes wrote: >> You might be able to build a multiple column index on (smiles, >> version_id) and have it do the right thing automatically. Whether that >> is possible, and if so how effective it will actually be, would depend >> on the implementation details of |>|. My gut feeling is that it would >> not work well. > > > No, because it's not a normal exact-match query. The analogy would be that > you can build a multi-column index for an '=' operation on a string, but it > wouldn't help if you were doing an '~' or 'LIKE' operation. That restriction only applies to BTREE indexes. GiST and GIN indexes work differently, and don't have that particular limitation. They can use the second column of the index even if the first column is not used, or (in the case of GiST at least) the first column is used with an operator other than equality. The main problems I've run into with GiST indexes is that they sometimes take absurdly long times to build; and that the split-picking algorithm might arrive at buckets ill-suited to your queries so that the consultation of the index "works" in the sense that it discards most of the non-matching rows without inspecting them, but isn't actually faster. Unfortunately, both of these problems seem hard to predict. You pretty much have to try it (on a full-size data set, as scaling up from toy data sets is also hard to predict) and see how it does. But, JChem's cartridge is apparently not using a GiST index, which is what my first guess was. I can't really figure out what PostgreSQL API it is tapping into, so whatever it is very well might not support multi-column indexes at all. >> You could partition your data on version_id. Then it would keep a >> separate smiles index on each partition, and would only consult those >> indexes which can possibly contain (according to the CHECK >> constraints) the version_ids of interest in the query. > > > I actually struck on this solution today and it works well. Instead > partitioning on the version_id, I added a column "p" ("partition") and used > 20 partitions where p is a random number from 0..19. This has the advantage > that as new compounds are added, they are distributed throughout the > partitions, so each partition remains a 5% sample of the whole. > > It's pretty cool. A full-table scan of all partitions is slightly slower, > but if I want to do a sample and limit the run time, I can query with p = 0. > > It also has another huge benefit for a web site: I can give the user a > progress-bar widget by querying the partitions one-by-one and updating the > progress in 5% increments. This is really critical for long-running queries. That does sound pretty useful. You could potentially get the same benefit with the multicolumn GiST index, without needing to partition the table. In a vague hand-wavy way, building an index "USING GIST (p, smiles jchem_op_class)" is like using p to automatically partition the index so it acts like individual indexes over smiles for each value of p. But it is unlikely to ever be as efficient as well-crafted explicit partitions, and once you have gone to the effort of setting them up there would probably be no point in trying to change over. >> Also, if you tune your system using benzene, you will be probably >> arrive at a place not optimal for more realistic queries. > > > No, it's actually very useful. I'm not interested in optimizing typical > queries, but rather in limiting worst-case queries. This is a public web > site, and you never know what molecule someone will draw. In fact, it's > quite common for visitors to draw silly molecules like benzine or methane > that would result in a heavy load if left to run to completion. My benefit in having a non-public web site, is that I can just walk over to their desk and yell at the people who do things like that to my database. (And I promise to stop searching for methane on your web site.) Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance