Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: index vs. seq scan choice?
Date: Thu, 24 May 2007 16:15:54 -0700
Message-ID: 
 <8C5B026B51B6854CBE88121DBF097A86C3A30D@ehost010-33.exch010.intermedia.net>
Thread-Topic: index vs. seq scan choice?
Thread-Index: AceeWXpA1aWjKfrUSniY2q7niQL4mA==
From: "George Pavlov" <gpavlov@mynewplace.com>
To: <pgsql-general@postgresql.org>

I am trying to figure out how the distribution of data affects index
usage by the query because I am seeing some behavior that does not seem
optimal to my uneducated eye.=20

I am on PG 8.1.8. I have two tables foo and foo_detail, both have been
vacuum analyzed recently. Both have a property_id column, both have an
index on it. The foo table has a state_code, also indexed, and the
relative share of rows for the two state_codes used in the example below
is:
  PA 	2842	2.80%
  MN 	2858	2.81%

The distribution of distinct property_ids is fairly similar:
  PA 	719	2.90%
  MN 	765	3.09%

A simple query filtered by PA vs. MN produces different results (see
below). The PA query does a Seq Scan, the MN query uses the index and is
>20 times faster. Both return about the same number of rows. I tried it
with all state_codes that have rows in foo and it seems that the cutoff
is somewhere around 3%, but there isn't a direct correlation (there are
state_codes that are < 3% that trigger a Seq Scan and there are ones
above 3% that result in an Index scan).

I am curious what could make the PA query to ignore the index. What are
the specific stats that are being used to make this decision? Would it
perform better if it were to use the index? Anything I can do to "nudge"
it towards using the index, which seems like a rather beneficial thing?

The actual queries:

explain analyze
select
  f.property_id
from foo f
  inner join foo_detail fd using (property_id)
where f.state_code =3D 'PA'

Merge Join  (cost=3D17842.71..18436.30 rows=3D3347 width=3D4) (actual
time=3D594.538..972.032 rows=3D2842 loops=3D1)
  Merge Cond: ("outer".property_id =3D "inner".property_id)
  ->  Sort  (cost=3D4381.72..4390.09 rows=3D3347 width=3D4) (actual
time=3D14.092..18.497 rows=3D2842 loops=3D1)
        Sort Key: f.property_id
        ->  Bitmap Heap Scan on foo f  (cost=3D22.71..4185.78 =
rows=3D3347
width=3D4) (actual time=3D0.826..7.008 rows=3D2842 loops=3D1)
              Recheck Cond: (state_code =3D 'PA'::bpchar)
              ->  Bitmap Index Scan on mv_search_state
(cost=3D0.00..22.71 rows=3D3347 width=3D0) (actual time=3D0.734..0.734 =
rows=3D2842
loops=3D1)
                    Index Cond: (state_code =3D 'PA'::bpchar)
  ->  Sort  (cost=3D13460.99..13732.84 rows=3D108742 width=3D4) (actual
time=3D580.312..754.012 rows=3D110731 loops=3D1)
        Sort Key: fd.property_id
        ->  Seq Scan on foo_detail fd  (cost=3D0.00..4364.42 =
rows=3D108742
width=3D4) (actual time=3D0.006..210.846 rows=3D108742 loops=3D1)
Total runtime: 991.852 ms

explain analyze
select
  f.property_id
from foo f
  inner join foo_detail fd using (property_id)
where f.state_code =3D 'MN'

Nested Loop  (cost=3D7.62..8545.85 rows=3D1036 width=3D4) (actual
time=3D0.877..44.196 rows=3D2858 loops=3D1)
  ->  Bitmap Heap Scan on foo f  (cost=3D7.62..2404.44 rows=3D1036 =
width=3D4)
(actual time=3D0.852..6.579 rows=3D2858 loops=3D1)
        Recheck Cond: (state_code =3D 'MN'::bpchar)
        ->  Bitmap Index Scan on mv_search_state  (cost=3D0.00..7.62
rows=3D1036 width=3D0) (actual time=3D0.744..0.744 rows=3D2858 =
loops=3D1)
              Index Cond: (state_code =3D 'MN'::bpchar)
  ->  Index Scan using ix_fd on foo_detail fd  (cost=3D0.00..5.92 =
rows=3D1
width=3D4) (actual time=3D0.005..0.007 rows=3D1 loops=3D2858)
        Index Cond: ("outer".property_id =3D fd.property_id)
Total runtime: 48.439 ms