Received: from localhost (maia-3.hub.org [200.46.204.184]) by postgresql.org (Postfix) with ESMTP id 098589FC151 for ; Thu, 7 Jun 2007 18:56:14 -0300 (ADT) Received: from postgresql.org ([200.46.204.71]) by localhost (mx1.hub.org [200.46.204.184]) (amavisd-maia, port 10024) with ESMTP id 20634-06 for ; Thu, 7 Jun 2007 18:56:04 -0300 (ADT) X-Greylist: from auto-whitelisted by SQLgrey-1.7.5 Received: from mis010.exch010.intermedia.net (mis010.exch010.intermedia.net [64.78.61.97]) by postgresql.org (Postfix) with ESMTP id 41CCE9FC17B for ; Thu, 7 Jun 2007 18:56:10 -0300 (ADT) Received: from ehost010-33.exch010.intermedia.net ([64.78.20.173]) by mis010.exch010.intermedia.net with Microsoft SMTPSVC(6.0.3790.1830); Thu, 7 Jun 2007 14:51:38 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: Re: index vs. seq scan choice? Date: Thu, 7 Jun 2007 14:56:06 -0700 Message-ID: <8C5B026B51B6854CBE88121DBF097A86DEA6B4@ehost010-33.exch010.intermedia.net> In-Reply-To: <27828.1180055291@sss.pgh.pa.us> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [GENERAL] index vs. seq scan choice? Thread-Index: AceeaKh+auyD/ZrQRj2HarXof3NSDAK4szhQ References: <8C5B026B51B6854CBE88121DBF097A86C3A30D@ehost010-33.exch010.intermedia.net> <27828.1180055291@sss.pgh.pa.us> From: "George Pavlov" To: "Tom Lane" , X-OriginalArrivalTime: 07 Jun 2007 21:51:38.0965 (UTC) FILETIME=[06C7D050:01C7A94E] X-Virus-Scanned: Maia Mailguard 1.0.1 X-Archive-Number: 200706/543 X-Sequence-Number: 115366 > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20 > "George Pavlov" writes: > > I am curious what could make the PA query to ignore the=20 > index. What are > > the specific stats that are being used to make this decision? >=20 > you don't have the column's statistics target set high enough to > track all the interesting values --- or maybe just not high enough to > acquire sufficiently accurate frequency estimates for them. =20 > Take a look at the pg_stats row for the column ... >=20 > (The default statistics target is 10, which is widely considered too > low --- you might find 100 more suitable.) Well, it seems that it would be more beneficial for me to set it LOWER than the default 10. I get better performance if the stats are less accurate because then the optimizer seems more likely to choose the index! States that are in pg_stats.most_common_vals most often result in a Seq Scan, whereas ones that are not in it definitely get the Index Scan. For all states, even the largest ones (15% of the data), the Index Scan performs better. So, for example, with SET STATISTICS 10 my benhcmark query in a state like Indiana (2981 rows, ~3% of total) runs in 132ms. If I SET STATISTICS 100, Indiana gets on the most_common_vals list for the column and the query does a Seq Scan and its run time jumps to 977ms! If I go the other way and SET STATISTICS 1 (or 0) I can bring down the list to one entry (setting to 0 seems equivalent and still keeps the one most common entry!?) and I will get the Index scan for all states except for that one most common state. But, of course, I don't want to undermine the whole stats mechanism, I just want the system to use the index that is so helpful and brings runtimes down by a factor of 4-8! What am I missing here? George