Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bFwNK-00033D-Pb for pgsql-performance@arkaria.postgresql.org; Thu, 23 Jun 2016 04:36:23 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1bFwNI-0002Yo-Te for pgsql-performance@arkaria.postgresql.org; Thu, 23 Jun 2016 04:36:20 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1bFwNH-0002YD-NK for pgsql-performance@postgresql.org; Thu, 23 Jun 2016 04:36:19 +0000 Received: from mail-yw0-x22c.google.com ([2607:f8b0:4002:c05::22c]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1bFwNC-0002It-7k for pgsql-performance@postgresql.org; Thu, 23 Jun 2016 04:36:18 +0000 Received: by mail-yw0-x22c.google.com with SMTP id l125so61007526ywb.2 for ; Wed, 22 Jun 2016 21:36:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=emolecules.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=6/erilsMXCA63HKwTc1VVkKKd8O1kppZhLY7thIfjy0=; b=UUhuBvmE8PZi9DpyIs/RDfwNswfcdt9wFG7d1sC42lB35fN70ps5z+w0OlFvARd9/x YViwiW3FvDjd5ACTwSOiVTwSGV6qNT+4s2Iy9W0eWN+I/9sDPwOiauWMMQ7e6ejTlOZq RW91tO7moB+VRdAJ5KnFkDdBjQ3g3/zGCMUeQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=6/erilsMXCA63HKwTc1VVkKKd8O1kppZhLY7thIfjy0=; b=gha+7HODE5nLO3t31wUbhqzOJGw/zMU2XmnZ6G1PMp4HYwVvip9Qil//Ft35neBiHW 3r748lPh1yFWlHnrAWhV3ZEX9ogbi+jTbhg0AWH3Yea2pV3dm01kXqJV93X1z1O2fUT0 s1q3dsPmBxVyxtcbgTBAIz0Wa54ZM0TCm3+acoy9aNY9eJy8lP81EeQcuer36sVrT5sv 5FlJb2yZGPqgwxdutlRq2kPDm5JYI7rpam4/V/bTbrSaT/9ndu4nndqyC4NrlnUe4dKL e56VUFWWZSZ2hzH5PXQhhfAtjAqPbxucZ01fpXGDkHrVWYwENVFjNsqlOEi0W0FTSmde Dg2w== X-Gm-Message-State: ALyK8tIVxobw1hmisGvIUKEGfRPgjADm8Ll+i5fbLb74dy4IUebT20SYA340cUDoH+jrWl5K19hF8ymYQkgrFFHk X-Received: by 10.129.84.193 with SMTP id i184mr17589222ywb.285.1466656571993; Wed, 22 Jun 2016 21:36:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.134.131 with HTTP; Wed, 22 Jun 2016 21:36:11 -0700 (PDT) In-Reply-To: References: From: Craig James Date: Wed, 22 Jun 2016 21:36:11 -0700 Message-ID: Subject: Re: Can't get two index scans To: Jeff Janes Cc: "pgsql-performance@postgresql.org" Content-Type: multipart/alternative; boundary=001a114d9a50fb63b80535ea9790 X-Pg-Spam-Score: -2.7 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --001a114d9a50fb63b80535ea9790 Content-Type: text/plain; charset=UTF-8 On Wed, Jun 22, 2016 at 11:36 AM, Jeff Janes wrote: > On Wed, Jun 22, 2016 at 9:03 AM, Craig James > wrote: > > I'm working with a third-party plugin that does chemistry. > > > Out of personal/professional curiosity, which one are you using, if > that can be disclosed? > ChemAxon (JChem) > > Notice that it doesn't use the i_version_smiles index at all, but instead > > applies the very expensive filter |>| to all 1% of the database. > > You have to tell the database that |>| is very expensive, by setting > the COST of the function which it invokes. You can get the name of > the function with: > > select oprcode from pg_operator where oprname ='|>|' ; > > (taking care for schema and overloading, etc.) > > I would set the COST to at least 1000, probably more. > I'll try this. I've done it with my own functions, but didn't realize you could do it with existing operators. > > So instead > > of getting a 100x speedup, we only get a 3x speedup, about 30x worse that > > what is theoretically possible. > > > > The production database is about 50x larger than this test database. > > > > Maybe I misunderstand what's possible with indexes, but it seems to me > that > > it could first do the pk_version index scan, and then use the results of > > that to do a limited index-scan search using the i_version_smiles index. > Is > > that not possible? > > I don't think it can do that. What it can do is run each index scan > to completion as a bitmap index scan, and then AND the bitmaps > together. > That won't help in this case because the index scan of the molecule table can be slow. > > You might be able to build a multiple column index on (smiles, > version_id) and have it do the right thing automatically. Whether that > is possible, and if so how effective it will actually be, would depend > on the implementation details of |>|. My gut feeling is that it would > not work well. > No, because it's not a normal exact-match query. The analogy would be that you can build a multi-column index for an '=' operation on a string, but it wouldn't help if you were doing an '~' or 'LIKE' operation. > You could partition your data on version_id. Then it would keep a > separate smiles index on each partition, and would only consult those > indexes which can possibly contain (according to the CHECK > constraints) the version_ids of interest in the query. > I actually struck on this solution today and it works well. Instead partitioning on the version_id, I added a column "p" ("partition") and used 20 partitions where p is a random number from 0..19. This has the advantage that as new compounds are added, they are distributed throughout the partitions, so each partition remains a 5% sample of the whole. It's pretty cool. A full-table scan of all partitions is slightly slower, but if I want to do a sample and limit the run time, I can query with p = 0. It also has another huge benefit for a web site: I can give the user a progress-bar widget by querying the partitions one-by-one and updating the progress in 5% increments. This is really critical for long-running queries. > Also, if you tune your system using benzene, you will be probably > arrive at a place not optimal for more realistic queries. > No, it's actually very useful. I'm not interested in optimizing typical queries, but rather in limiting worst-case queries. This is a public web site, and you never know what molecule someone will draw. In fact, it's quite common for visitors to draw silly molecules like benzine or methane that would result in a heavy load if left to run to completion. Thanks for your help! Craig > Cheers, > > Jeff > -- --------------------------------- Craig A. James Chief Technology Officer eMolecules, Inc. --------------------------------- --001a114d9a50fb63b80535ea9790 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On W= ed, Jun 22, 2016 at 11:36 AM, Jeff Janes <jeff.janes@gmail.com><= /span> wrote:
On Wed, Jun 22, 2016 at 9:0= 3 AM, Craig James <cjames@emole= cules.com> wrote:
> I'm working with a third-party plugin that does chemistry.


Out of personal/professional curiosity, which one are you using, if
that can be disclosed?

ChemAxon (JChem)=
=C2=A0
> Notice that it d= oesn't use the i_version_smiles index at all, but instead
> applies the very expensive filter |>| to all 1% of the database.
You have to tell the database that |>| is very expensive, by setting
the COST of the function which it invokes.=C2=A0 You can get the name of the function with:

select oprcode from pg_operator where oprname =3D'|>|' ;

(taking care for schema and overloading, etc.)

I would set the COST to at least 1000, probably more.
=
I'll try this. I've done it with my own functions, b= ut didn't realize you could do it with existing operators.

> So instead
> of getting a 100x speedup, we only get a 3x speedup, about 30x worse t= hat
> what is theoretically possible.
>
> The production database is about 50x larger than this test database. >
> Maybe I misunderstand what's possible with indexes, but it seems t= o me that
> it could first do the pk_version index scan, and then use the results = of
> that to do a limited index-scan search using the i_version_smiles inde= x. Is
> that not possible?

I don't think it can do that.=C2=A0 What it can do is run each index sc= an
to completion as a bitmap index scan, and then AND the bitmaps
together.

That won't help in this c= ase because the index scan of the molecule table can be slow.
=C2= =A0

You might be able to build a multiple column index on (smiles,
version_id) and have it do the right thing automatically. Whether that
is possible, and if so how effective it will actually be, would depend
on the implementation details of |>|. My gut feeling is that it would not work well.

No, because it's not= a normal exact-match query. The analogy would be that you can build a mult= i-column index for an '=3D' operation on a string, but it wouldn= 9;t help if you were doing an '~' or 'LIKE' operation.
=C2=A0
You could partition your d= ata on version_id.=C2=A0 Then it would keep a
separate smiles index on each partition, and would only consult those
indexes which can possibly contain (according to the CHECK
constraints) the version_ids of interest in the query.

I actually struck on this solution today and it works well.= Instead partitioning on the version_id, I added a column "p" (&q= uot;partition") and used 20 partitions where p is a random number from= 0..19. This has the advantage that as new compounds are added, they are di= stributed throughout the partitions, so each partition remains a 5% sample = of the whole.

It's pretty cool. A full-table s= can of all partitions is slightly slower, but if I want to do a sample and = limit the run time, I can query with p =3D 0.

It a= lso has another huge benefit for a web site: I can give the user a progress= -bar widget by querying the partitions one-by-one and updating the progress= in 5% increments. This is really critical for long-running queries.=C2=A0<= /div>


Also, if you tune your system using benzene, you will be probably
arrive at a place not optimal for more realistic queries.
<= div>
No, it's actually very useful. I'm not intereste= d in optimizing typical queries, but rather in limiting worst-case queries.= This is a public web site, and you never know what molecule someone will d= raw. In fact, it's quite common for visitors to draw silly molecules li= ke benzine or methane that would result in a heavy load if left to run to c= ompletion.

Thanks for your help!
Craig


Cheers,

Jeff



--
---------------------------------
Craig A. James
C= hief Technology Officer
eMolecules, Inc.
----------------= -----------------
--001a114d9a50fb63b80535ea9790--