Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
5+ messages / 2 participants
[nested] [flat]

* Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
@ 2024-05-13 13:32 David Rowley <[email protected]>
  2024-05-13 13:52 ` Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions Dimitrios Apostolou <[email protected]>
  2024-05-14 18:11 ` Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions Dimitrios Apostolou <[email protected]>
  0 siblings, 2 replies; 5+ messages in thread

From: David Rowley @ 2024-05-13 13:32 UTC (permalink / raw)
  To: Dimitrios Apostolou <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]

On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <[email protected]> wrote:
>
> On Sat, 11 May 2024, David Rowley wrote:
> > It will. It's just that Sorting requires fetching everything from its subnode.
>
> Isn't it plain wrong to have a sort step in the plan than? The different
> partitions contain different value ranges with no overlap, and the last
> query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
> clause on an indexed column.

The query does contain an ORDER BY, so if the index is not chosen to
provide pre-sorted input, then something has to put the results in the
correct order before the LIMIT is applied.

> Even with bad estimates, even with seq scan instead of index scan, the
> plan should be such that it concludes all parallel work as soon as it
> finds the 10 distinct values. And this is actually achieved if I disable
> parallel plans. Could it be a bug in the parallel plan generation?

If you were to put the n_distinct_inherited estimate back to 200 and
disable sort, you should see the costs are higher for the index plan.
If that's not the case then there might be a bug.  It seems more
likely that due to the n_distinct estimate being so low that the
planner thought that a large enough fraction of the rows needed to be
read and that made the non-index plan appear cheaper.

I'd be interested in seeing what the costs are for the index plan. I
think the following will give you that (untested):

alter table test_runs_raw alter column workitem_n set
(n_distinct_inherited=200);
analyze test_runs_raw;
set enable_sort=0;
explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY
workitem_n DESC LIMIT 10;

-- undo
alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1);
reset enable_sort;

David






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
  2024-05-13 13:32 Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions David Rowley <[email protected]>
@ 2024-05-13 13:52 ` Dimitrios Apostolou <[email protected]>
  2024-05-13 14:15   ` Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions David Rowley <[email protected]>
  1 sibling, 1 reply; 5+ messages in thread

From: Dimitrios Apostolou @ 2024-05-13 13:52 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]

On Tue, 14 May 2024, David Rowley wrote:

> On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <[email protected]> wrote:
>>
>> On Sat, 11 May 2024, David Rowley wrote:
>>> It will. It's just that Sorting requires fetching everything from its subnode.
>>
>> Isn't it plain wrong to have a sort step in the plan than? The different
>> partitions contain different value ranges with no overlap, and the last
>> query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
>> clause on an indexed column.
>
> The query does contain an ORDER BY, so if the index is not chosen to
> provide pre-sorted input, then something has to put the results in the
> correct order before the LIMIT is applied.

The last query I tried was:

SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

See my message at

[1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Will re-check things and report back with further debugging info you asked
for later today.


Dimitris







^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
  2024-05-13 13:32 Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions David Rowley <[email protected]>
  2024-05-13 13:52 ` Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions Dimitrios Apostolou <[email protected]>
@ 2024-05-13 14:15   ` David Rowley <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: David Rowley @ 2024-05-13 14:15 UTC (permalink / raw)
  To: Dimitrios Apostolou <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]

On Tue, 14 May 2024 at 01:52, Dimitrios Apostolou <[email protected]> wrote:
>
> On Tue, 14 May 2024, David Rowley wrote:
> > The query does contain an ORDER BY, so if the index is not chosen to
> > provide pre-sorted input, then something has to put the results in the
> > correct order before the LIMIT is applied.
>
> The last query I tried was:
>
> SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

I was looking at the original query.   In that case, we have 2 ways to
remove duplicate rows with DISTINCT, "Hash Aggregate" and "Sort" ->
"Unique". Both of these will consume all of their input rows before
outputting any rows.

DISTINCT with LIMIT is a special case that we don't have a good
operator for.  In theory, we could have some "Hash Distinct" node type
that was less eager to consume all of its input rows.  When invoked
"Hash Distinct" could consume input rows until it found one that
didn't exist in the hash table.  I've no idea how that would work when
we exceed work_mem.  However, most queries with a LIMIT will have an
ORDER BY, so such a node likely wouldn't get much use.

David

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
  2024-05-13 13:32 Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions David Rowley <[email protected]>
@ 2024-05-14 18:11 ` Dimitrios Apostolou <[email protected]>
  2024-05-14 18:26   ` Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions Dimitrios Apostolou <[email protected]>
  1 sibling, 1 reply; 5+ messages in thread

From: Dimitrios Apostolou @ 2024-05-14 18:11 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]

On Tue, 14 May 2024, David Rowley wrote:
>
> If you were to put the n_distinct_inherited estimate back to 200 and
> disable sort, you should see the costs are higher for the index plan.
> If that's not the case then there might be a bug.  It seems more
> likely that due to the n_distinct estimate being so low that the
> planner thought that a large enough fraction of the rows needed to be
> read and that made the non-index plan appear cheaper.
>
> I'd be interested in seeing what the costs are for the index plan. I
> think the following will give you that (untested):
>
> alter table test_runs_raw alter column workitem_n set
> (n_distinct_inherited=200);
> analyze test_runs_raw;

I had to stop this step because it was taking too long going through all
partitions again. But it seems it had the desired effect.

> set enable_sort=0;
> explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

It chooses the non-parallel index plan:

  Limit  (cost=365.17..1135517462.36 rows=10 width=4)
    ->  Unique  (cost=365.17..22710342308.83 rows=200 width=4)
          ->  Append  (cost=365.17..22546660655.46 rows=65472661350 width=4)
                ->  Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
                ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
[... only index scans follow]

LIMIT 100 goes for the parallel seqscan plan, that even contains a sort!
But it seams to me that the extra upper level HashAggregate step raises
the cost by an order of magnitude, from 800M to 10G, in comparison to
doing (Unique->Sort) - see plan in the next paragraph.

  Limit  (cost=10857220388.76..10857220389.01 rows=100 width=4)
    ->  Sort  (cost=10857220388.76..10857220389.26 rows=200 width=4)
          Sort Key: test_runs_raw.workitem_n DESC
          ->  HashAggregate  (cost=857220379.12..857220381.12 rows=200 width=4)
                Group Key: test_runs_raw.workitem_n
                ->  Gather  (cost=857220295.12..857220377.12 rows=800 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=857219295.12..857219297.12 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..816295259.21 rows=16369614363 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 rows=64122388 width=4)
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 rows=61611136 width=4)
[... only seqscans follow]



If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 10:

SET SESSION enable_sort TO TRUE;
EXPLAIN  SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

  Limit  (cost=857166256.39..857166256.59 rows=10 width=4)
    ->  Unique  (cost=857166256.39..857166260.39 rows=200 width=4)
          ->  Sort  (cost=857166256.39..857166258.39 rows=800 width=4)
                Sort Key: test_runs_raw.workitem_n DESC
                ->  Gather  (cost=857166135.82..857166217.82 rows=800 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=857165135.82..857165137.82 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..816243567.24 rows=16368627432 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 rows=64122388 width=4)
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 rows=61611136 width=4)
[... only seqscans follow]

So in order of higher to lower cost, we have the following alternatives:

1. non-parallel index scan  (800M)
2. parallel seqscan with sort  (1.3G)
3. parallel seqscan without sort but actually has a sort  (10G assuming it's the same as for LIMIT 100)

>
> -- undo
> alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1);

I believe I need to set it to 0 to be back to defaults.

> reset enable_sort;
>


Regards,
Dimitris







^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
  2024-05-13 13:32 Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions David Rowley <[email protected]>
  2024-05-14 18:11 ` Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions Dimitrios Apostolou <[email protected]>
@ 2024-05-14 18:26   ` Dimitrios Apostolou <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Dimitrios Apostolou @ 2024-05-14 18:26 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]

I have forgotten to mention that I have enable_partitionwise_aggregate=on
in the global settings since the beginning. According to the docs:

> Enables or disables the query planner's use of partitionwise grouping or
> aggregation, which allows grouping or aggregation on partitioned tables
> to be performed separately for each partition.

Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every
partition, and then it would be up to independent plans to decide whether
each partition follows a parallel or a serial plan.

Not sure if this plan was checked but rejected because of cost.


Dimitris







^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2024-05-14 18:26 UTC | newest]

Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-05-13 13:32 Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions David Rowley <[email protected]>
2024-05-13 13:52 ` Dimitrios Apostolou <[email protected]>
2024-05-13 14:15   ` David Rowley <[email protected]>
2024-05-14 18:11 ` Dimitrios Apostolou <[email protected]>
2024-05-14 18:26   ` Dimitrios Apostolou <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox