public inbox for [email protected]
help / color / mirror / Atom feedFrom: Manikandan Swaminathan <[email protected]>
To: Tom Lane <[email protected]>
Cc: [email protected]
Subject: Re: Postgres Query Plan using wrong index
Date: Wed, 2 Apr 2025 20:24:08 -0700
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
Thanks Tom.
Since you mentioned the planner not knowing about the correlation between the columns, I’m curious, why doesn’t making a multivariate statistic make a difference?
CREATE STATISTICS col_a_col_b_stats (dependencies) ON col_a, col_b FROM test_table;
ANALYZE test_table;
And the resulting query plan which uses just the index on col_b:
postgres=# explain analyze select min(col_b) from test_table where col_a > 4996;
Result (cost=62.13..62.14 rows=1 width=4) (actual time=536.648..536.649 rows=1 loops=1)
InitPlan 1
-> Limit (cost=0.43..62.13 rows=1 width=4) (actual time=536.641..536.641 rows=1 loops=1)
-> Index Scan using idx_col_a_btree on test_table (cost=0.43..254987.43 rows=4133 width=4) (actual time=536.640..536.640 rows=1 loops=1)
Filter: (col_a > 4996)
Rows Removed by Filter: 9992000
Planning Time: 0.285 ms
Execution Time: 536.681 ms
>
> On Apr 2, 2025, at 5:30 PM, Tom Lane <[email protected]> wrote:
>
> Manikandan Swaminathan <[email protected]> writes:
>> 1) Why is the query currently picking the poorly performing index?
>
> Because the planner thinks that one will be cheaper, as you can see by
> comparing the cost estimates in EXPLAIN. It's wrong, but this is a
> hard problem to estimate well. Especially when the behavior depends
> on a correlation between columns that the planner knows nothing about.
>
>> 2) Why would the index you suggested, (col_b, col_a), perform better than (col_a, col_b)? I would’ve expected the filter on col_a to come first, followed by the aggregate on col_b. In my mind, it needs to find rows matching the col_a condition before calculating the MIN(col_b), and I assumed it would traverse the B-tree accordingly.
>
> The idea the planner is using is "scan the index in order (that is,
> in col_b order) until you find the first row satisfying the other
> constraints (that is, the col_a condition). Then that row's col_b
> value is the correct MIN(), and you can stop." Since it knows nothing
> of the cross-column correlation, its estimate of how many rows it'll
> have to scan through is overly optimistic. But it knows that the
> other way involves scanning a whole lot of the index --- there's no
> chance of stopping early --- so that's estimated as higher-cost.
>
> The index I suggested on (col_b, col_a) is amenable to this same
> plan shape, since col_b is still the major sort column. The
> reason it wins is that the col_a condition can be checked in the
> index without having to visit the heap, thus eliminating a lot of
> random access to the heap.
>
>> 3) Why does the planner choose the better-performing (col_a, col_b) index when the filter is col_a > 5000, but switch to the slower (col_b) index when the filter is not at the edge of the range, like col_a > 4996?
>
> At some point, as less and less of the col_a-major index would need to
> be scanned, there's a crossover in the cost estimates for the two ways
> of doing this. I would not have cared to predict where the crossover
> is, but you evidently found it empirically.
>
>> For reference, here’s the query plan when filtering for col_a > 5000. It uses the correct index on (col_a, col_b).
>
> You would do a lot better to approach this without rigid notions of
> which is the "correct" index. All of the ones we've discussed are
> potentially usable for this query, and they all have different cost
> curves depending on how selective the col_a condition is. Even the
> index on col_b alone could potentially be the best, because it'll be
> smaller than the two-column indexes. So if the col_a condition is
> very unselective then it's (at least in theory) possible that that
> would be the best choice.
>
> regards, tom lane
view thread (4+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Postgres Query Plan using wrong index
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox