public inbox for [email protected]  
help / color / mirror / Atom feed
From: bricklen <[email protected]>
To: Rob Imig <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Performant queries on table with many boolean columns
Date: Sun, 24 Apr 2016 13:10:46 -0700
Message-ID: <CAGrpgQ9PR8zh2Hdnoa3VnThsbAYo4GA1khbbo59+t9ddAwMqTg@mail.gmail.com> (raw)
In-Reply-To: <CANcrS5oP6+CRucJD9R59E2V9NoY4ZrAbW91pxj2eoivGfsz_ng@mail.gmail.com>
References: <CANcrS5pR1P1Tj=e-RQQ=FF3WPAy_fyruS0YJer-+iJHxR1JAiA@mail.gmail.com>
	<[email protected]>
	<CAMkU=1xXXANTYp7YLQjjzvgYCFQsr+y6iTphx1Q4y8iP7QpRjw@mail.gmail.com>
	<CANcrS5oHanMwESO0jaqROV44wQRNO6y_gXj8C4=uY27t3u7TfQ@mail.gmail.com>
	<CANcrS5oP6+CRucJD9R59E2V9NoY4ZrAbW91pxj2eoivGfsz_ng@mail.gmail.com>
List-Unsubscribe:  <mailto:[email protected]?body=unsub%20pgsql-performance>

On Fri, Apr 22, 2016 at 6:57 AM, Rob Imig <[email protected]> wrote:

> Just to followup where I'm at, I've constructed a new column which is a
> 100 bit bitstring representing all the flags. Created a b-tree index on
> that column and can now do super fast lookups (2) for specific scenarios
> however getting the behavior I need would require a huge amount of OR
> conditions (as Rick mentioned earlier). Another option is to do bitwiser
> operators (3) but that seems really slow. Not sure how I can speed that up.
>

I tried a slightly different tact - how about creating a function-based
md5() index over your columns and doing the same for you input values? For
the test I ran, I used a char datatype with two possible values: '1' (true)
and '0' (false).
The columns were named (for simplicity), c1 to c100.

eg.
create index lots_of_columns_md5_idx on lots_of_columns (
md5(c1||c2||c3||c4||c5||c6||c7||c8||c9||c10||
c11||c12||c13||c14||c15||c16||c17||c18||c19||c20||
c21||c22||c23||c24||c25||c26||c27||c28||c29||c30||
c31||c32||c33||c34||c35||c36||c37||c38||c39||c40||
c41||c42||c43||c44||c45||c46||c47||c48||c49||c50||
c51||c52||c53||c54||c55||c56||c57||c58||c59||c60||
c61||c62||c63||c64||c65||c66||c67||c68||c69||c70||
c71||c72||c73||c74||c75||c76||c77||c78||c79||c80||
c81||c82||c83||c84||c85||c86||c87||c88||c89||c90||
c91||c92||c93||c94||c95||c96||c97||c98||c99||c100)
) with (fillfactor=100);

The query then looked like:
select ...
from ...
where md5(all||the||columns) = md5(all||your||values);

The test data I fabricated wasn't necessarily 85% true as you expect your
data to be, but the tests I ran were returning results in single-digit
milliseconds for a 1M row table. The queries become a bit more difficult to
create as you need to concatenate all the values together. You could pass
the list of columns into a function to abstract that away from the query,
but that might mess with the planner.
Note that the method suggested here relies on column ordering always being
the same, otherwise the hash will be different/inaccurate.


reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Performant queries on table with many boolean columns
  In-Reply-To: <CAGrpgQ9PR8zh2Hdnoa3VnThsbAYo4GA1khbbo59+t9ddAwMqTg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox