Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Dimitrios Apostolou <[email protected]>
To: David Rowley <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: [email protected]
Subject: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Date: Mon, 13 May 2024 14:41:05 +0200 (CEST)
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAApHDvrtTKfh7HgAyXBd3KN0s-jxiHzW7sWdm-sFEjP6fGPCkg@mail.gmail.com>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAApHDvrtTKfh7HgAyXBd3KN0s-jxiHzW7sWdm-sFEjP6fGPCkg@mail.gmail.com>

On Sat, 11 May 2024, David Rowley wrote:

> On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <[email protected]> wrote:
>> Indeed that's an awful estimate, the table has more than 1M of unique
>> values in that column. Looking into pg_stat_user_tables, I can't see the
>> partitions having been vacuum'd or analyzed at all. I think they should
>> have been auto-analyzed, since they get a ton of INSERTs
>> (no deletes/updates though) and I have the default autovacuum settings.
>> Could it be that autovacuum starts, but never
>> finishes? I can't find something in the logs.
>
> It's not the partitions getting analyzed you need to worry about for
> an ndistinct estimate on the partitioned table. It's auto-analyze or
> ANALYZE on the partitioned table itself that you should care about.
>
> If you look at [1], it says "Tuples changed in partitions and
> inheritance children do not trigger analyze on the parent table."

Thanks

>
>> In any case, even after the planner decides to execute the terrible plan
>> with the parallel seqscans, why doesn't it finish right when it finds 10
>> distinct values?
>
> It will. It's just that Sorting requires fetching everything from its subnode.

Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.

Even with bad estimates, even with seq scan instead of index scan, the
plan should be such that it concludes all parallel work as soon as it
finds the 10 distinct values. And this is actually achieved if I disable
parallel plans. Could it be a bug in the parallel plan generation?


Dimitris

view thread (17+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox