Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auQOS-0005H8-H6 for pgsql-performance@arkaria.postgresql.org; Sun, 24 Apr 2016 20:12:36 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1auQOS-0004TC-2k for pgsql-performance@arkaria.postgresql.org; Sun, 24 Apr 2016 20:12:36 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1auQMm-0002bD-C9 for pgsql-performance@postgresql.org; Sun, 24 Apr 2016 20:10:52 +0000 Received: from mail-yw0-x232.google.com ([2607:f8b0:4002:c05::232]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1auQMi-0005GJ-NU for pgsql-performance@postgresql.org; Sun, 24 Apr 2016 20:10:51 +0000 Received: by mail-yw0-x232.google.com with SMTP id g133so151761309ywb.2 for ; Sun, 24 Apr 2016 13:10:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=mZE+usf0OHNmRl1hn7+QaW7NWcBVOq2uPrvEdfDGTG0=; b=VfVKfkiJrTTAqhETZttbjtIgt9+YmeESS172QetHhXdY0ra3qyWKWE/wDJJCc5EC1b xSAYrv42u0fklu5UdO2GIP13QzYXKPnp2g40ApHoxzpdC9gvbCB9ctY+HGimXPm0wSQb 243sEaFf4KVrK98MGhIWBcdhOd6MuiROSpY1WP3jgGdk0M4X4EqPxQnL3vWeeZOR829n fzur8D8qPoYZb+px6u+QnNwAd8dRmJjr/lp+J1vCJAzNDhYGdEjYKLC+mKKLB6xaqPj3 LR2HmITwpgC2vy6Qvyueuqii1BEo2PbOq++sgZwtbAhiwZUy0/Y2U8FSgrHqWyeJkhX6 UYBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=mZE+usf0OHNmRl1hn7+QaW7NWcBVOq2uPrvEdfDGTG0=; b=h8M+rkXCpl0S37W7Zibzp4GTU5pOhhDsma6PtcuK67NtQe+xb2/qw6OsbJ+5192/eg dbXD4O2ES4LIVzIqjsKz9kJ21q9QR6W6fddGn8ggYj2akjwRBJ/Hb1Z3JKHPiKxYk4xm BC+OqwzHI0odhKABEZ7W2g8Py7WwWpmEoz2w4npmVRZ5ufZxUdWWVmnx5xQnYjMrHxWE W6N9jJJQFzBdaIiSZICfwQlMO1SwMJcCnB+bWiveMpEsVVq75r6ws9+5bl5ZkJQt2fOf 6/vHRbEENow1eSsbLolWeOXI1PQ8skYwVz7J510N8jsOZYsw3OjvIOMx48F6TWPjizce j9Dw== X-Gm-Message-State: AOPr4FWkpT6uB9o7rJEi9aoiqJvaE7RKPGKuNgYSsEbxkjrHV+Gn7118bMClt0pzDNPgoVL1ohuvdpUeyzV8IA== MIME-Version: 1.0 X-Received: by 10.129.77.2 with SMTP id a2mr17852566ywb.151.1461528646573; Sun, 24 Apr 2016 13:10:46 -0700 (PDT) Received: by 10.37.209.212 with HTTP; Sun, 24 Apr 2016 13:10:46 -0700 (PDT) In-Reply-To: References: <5717D07E.3070802@sigaev.ru> Date: Sun, 24 Apr 2016 13:10:46 -0700 Message-ID: Subject: Re: Performant queries on table with many boolean columns From: bricklen To: Rob Imig Cc: "pgsql-performance@postgresql.org" Content-Type: multipart/alternative; boundary=001a1140a026cee750053140a779 X-Pg-Spam-Score: -2.7 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --001a1140a026cee750053140a779 Content-Type: text/plain; charset=UTF-8 On Fri, Apr 22, 2016 at 6:57 AM, Rob Imig wrote: > Just to followup where I'm at, I've constructed a new column which is a > 100 bit bitstring representing all the flags. Created a b-tree index on > that column and can now do super fast lookups (2) for specific scenarios > however getting the behavior I need would require a huge amount of OR > conditions (as Rick mentioned earlier). Another option is to do bitwiser > operators (3) but that seems really slow. Not sure how I can speed that up. > I tried a slightly different tact - how about creating a function-based md5() index over your columns and doing the same for you input values? For the test I ran, I used a char datatype with two possible values: '1' (true) and '0' (false). The columns were named (for simplicity), c1 to c100. eg. create index lots_of_columns_md5_idx on lots_of_columns ( md5(c1||c2||c3||c4||c5||c6||c7||c8||c9||c10|| c11||c12||c13||c14||c15||c16||c17||c18||c19||c20|| c21||c22||c23||c24||c25||c26||c27||c28||c29||c30|| c31||c32||c33||c34||c35||c36||c37||c38||c39||c40|| c41||c42||c43||c44||c45||c46||c47||c48||c49||c50|| c51||c52||c53||c54||c55||c56||c57||c58||c59||c60|| c61||c62||c63||c64||c65||c66||c67||c68||c69||c70|| c71||c72||c73||c74||c75||c76||c77||c78||c79||c80|| c81||c82||c83||c84||c85||c86||c87||c88||c89||c90|| c91||c92||c93||c94||c95||c96||c97||c98||c99||c100) ) with (fillfactor=100); The query then looked like: select ... from ... where md5(all||the||columns) = md5(all||your||values); The test data I fabricated wasn't necessarily 85% true as you expect your data to be, but the tests I ran were returning results in single-digit milliseconds for a 1M row table. The queries become a bit more difficult to create as you need to concatenate all the values together. You could pass the list of columns into a function to abstract that away from the query, but that might mess with the planner. Note that the method suggested here relies on column ordering always being the same, otherwise the hash will be different/inaccurate. --001a1140a026cee750053140a779 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Fri, Apr 22, 2016 at 6:57 AM, Rob Imig <rimig88@gmail.com> wrote:
Just to followup where I'm at, I've constructed a new colu= mn which is a 100 bit bitstring representing all the flags. Created a b-tre= e index on that column and can now do super fast lookups (2) for specific s= cenarios however getting the behavior I need would require a huge amount of= OR conditions (as Rick mentioned earlier). Another option is to do bitwise= r operators (3) but that seems really slow. Not sure how I can speed that u= p.

I tried a slightly different t= act - how about creating a function-based md5() index over your columns and= doing the same for you input values? For the test I ran, I used a char dat= atype with two possible values: '1' (true) and '0' (false).=
The columns were named (for simplicity), c1 to c100.

eg.
create index lots_of_columns_md5_idx on lots_of_columns (
md5(c1= ||c2||c3||c4||c5||c6||c7||c8||c9||c10||
c11||c12||c13||c14||c15||c16||c1= 7||c18||c19||c20||
c21||c22||c23||c24||c25||c26||c27||c28||c29||c30||c31||c32||c33||c34||c35||c36||c37||c38||c39||c40||
c41||c42||c43||c44||= c45||c46||c47||c48||c49||c50||
c51||c52||c53||c54||c55||c56||c57||c58||c= 59||c60||
c61||c62||c63||c64||c65||c66||c67||c68||c69||c70||
c71||c72= ||c73||c74||c75||c76||c77||c78||c79||c80||
c81||c82||c83||c84||c85||c86|= |c87||c88||c89||c90||
c91||c92||c93||c94||c95||c96||c97||c98||c99||c100)=
) with (fillfactor=3D100);

The query then = looked like:
select ...
from ...
where md5(all||the||columns) =3D = md5(all||your||values);

The test data I fabricated wasn&#= 39;t necessarily 85% true as you expect your data to be, but the tests I ra= n were returning results in single-digit milliseconds for a 1M row table. T= he queries become a bit more difficult to create as you need to concatenate= all the values together. You could pass the list of columns into a functio= n to abstract that away from the query, but that might mess with the planne= r.
Note that the method suggested here relies on column order= ing always being the same, otherwise the hash will be different/inaccurate.=
=C2=A0
--001a1140a026cee750053140a779--