Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

public inbox for [email protected]  
help / color / mirror / Atom feed

From: David Rowley <[email protected]>
To: Dimitrios Apostolou <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: [email protected]
Subject: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Date: Tue, 14 May 2024 01:32:02 +1200
Message-ID: <CAApHDvo8yYvqa1+bkW_f5xHX-gmKGYfaGwH+Y_KP-=9TOuF+-g@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAApHDvrtTKfh7HgAyXBd3KN0s-jxiHzW7sWdm-sFEjP6fGPCkg@mail.gmail.com>
	<[email protected]>

On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <[email protected]> wrote:
>
> On Sat, 11 May 2024, David Rowley wrote:
> > It will. It's just that Sorting requires fetching everything from its subnode.
>
> Isn't it plain wrong to have a sort step in the plan than? The different
> partitions contain different value ranges with no overlap, and the last
> query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
> clause on an indexed column.

The query does contain an ORDER BY, so if the index is not chosen to
provide pre-sorted input, then something has to put the results in the
correct order before the LIMIT is applied.

> Even with bad estimates, even with seq scan instead of index scan, the
> plan should be such that it concludes all parallel work as soon as it
> finds the 10 distinct values. And this is actually achieved if I disable
> parallel plans. Could it be a bug in the parallel plan generation?

If you were to put the n_distinct_inherited estimate back to 200 and
disable sort, you should see the costs are higher for the index plan.
If that's not the case then there might be a bug.  It seems more
likely that due to the n_distinct estimate being so low that the
planner thought that a large enough fraction of the rows needed to be
read and that made the non-index plan appear cheaper.

I'd be interested in seeing what the costs are for the index plan. I
think the following will give you that (untested):

alter table test_runs_raw alter column workitem_n set
(n_distinct_inherited=200);
analyze test_runs_raw;
set enable_sort=0;
explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY
workitem_n DESC LIMIT 10;

-- undo
alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1);
reset enable_sort;

David

view thread (5+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
  In-Reply-To: <CAApHDvo8yYvqa1+bkW_f5xHX-gmKGYfaGwH+Y_KP-=9TOuF+-g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox