Date: Tue, 14 May 2024 14:55:26 +0200 (CEST)
From: Dimitrios Apostolou <jimis@gmx.net>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: pgsql-general@lists.postgresql.org, David Rowley <dgrowleyml@gmail.com>
Subject: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan
 on huge table with 1000 partitions
In-Reply-To: <1629463.1715372568@sss.pgh.pa.us>
Message-ID: <660a8477-4130-40da-3492-f8827c5c3596@gmx.net>
References: <7886a68f-b466-2131-1747-f69f0fb71a37@gmx.net> <69077f15-4125-2d63-733f-21ce6eac4f01@gmx.net> <559b0e40-63e6-fa9a-6b03-d1eba10f30f8@gmx.net> <1629463.1715372568@sss.pgh.pa.us>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
UI-OutboundReport: notjunk:1;M01:P0:xGxwux+K5QU=;fY81U4y0Xg5YqAD6L5Vq4Q4gVYb
 NiUd04kFPLRvyP5QE91FPnB0n415LWjrvwuA+L4EY1Promn/w2EJJnj7nKOV4+LmIWY15bC6I
 iHH91gQKtIch6n53htQwQfmPWPpaFmO6dEjZgcFs11HgpEmSe39SIEef45oYJXwMe69BAN5WI
 PzhWYopKadH17xnEnXWyCt6DWLI9gGhMzBD10DpeQI9aCRMvfTTcyXRsTxY5owAx7aBGJluYT
 BsJNA0ie8dbD58cyzCmDTcENw9kcsMhLZBN63OXXORC7zuOQ2QuwRVtjCJ7KPLPJQ/Eol9DQa
 hvFqfCN+03CCHxCoFhCTPCNM4+ChfhlXQFpRMA51Tyal1kFZycsKt9OFClGamL2FDSdLcdkEp
 9A7Im5aJ5CSNSxp05LcB8v5bXx95TTL7wg0S4TlN3SdH6xvBbnu1ibeJiZF7rSpD38OinAd+l
 lUyNc4845sTKDwfCPL/iVy0AwvfNmXBRfZc1kBt5Oqq8/gsbNAkYhB7H0r/rrgkygh5WzYw9/
 Ks87rtL7TpoQPrfN+t97mAnjVVDEig/w+n9uFJOZLVM2LR6M9BuBuej7/J6wC/IS0E3de+igJ
 tQuJgrOeiyCBaClcXyC+TqXrTJSX66ShO2ZN8d7QCl2YGnwVzXM+DUHXjY5za1gvN7rfTyV6m
 1hSF0GLHi65TiAs9edjr4kldqyU4SjK729Czpp8VJ1zGTWQ36m2+/axCvvDqS2EE39fr1V/da
 uHtTcnvq909HSjxZK6qVD/cFNtKmBtPfd5xpJWc4ikJ8JJwTOtl6Ll4h/tpgsJlKIs8XY/m8n
 P413xBL1EqsqT6m8PYOrclyCJHUqOAANL5Psud6jAsmLA=
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/660a8477-4130-40da-3492-f8827c5c3596%40gmx.net>
Precedence: bulk


On Fri, 10 May 2024, Tom Lane wrote:

> Dimitrios Apostolou <jimis@gmx.net> writes:
>> Further digging into this simple query, if I force the non-parallel pla=
n
>> by setting max_parallel_workers_per_gather TO 0, I see that the query
>> planner comes up with a cost much higher:
>
>>   Limit  (cost=3D363.84..1134528847.47 rows=3D10 width=3D4)
>>     ->  Unique  (cost=3D363.84..22690570036.41 rows=3D200 width=3D4)
>>           ->  Append  (cost=3D363.84..22527480551.58 rows=3D65235793929=
 width=3D4)
>> ...
>
>> The total cost on the 1st line (cost=3D363.84..1134528847.47) has a muc=
h
>> higher upper limit than the total cost when
>> max_parallel_workers_per_gather is 4 (cost=3D853891608.79..853891608.99=
).
>> This explains the planner's choice. But I wonder why the cost estimatio=
n
>> is so far away from reality.
>
> I'd say the blame lies with that (probably-default) estimate of
> just 200 distinct rows.  That means the planner expects to have
> to read about 5% (10/200) of the tables to get the result, and
> that's making fast-start plans look bad.
>
> Possibly an explicit ANALYZE on the partitioned table would help.

It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)

And it made a difference indeed, the serial plan is chosen now:

EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n =
DESC LIMIT 10;
  Limit  (cost=3D364.29..1835512.29 rows=3D10 width=3D4)
    ->  Unique  (cost=3D364.29..22701882164.56 rows=3D123706 width=3D4)
          ->  Append  (cost=3D364.29..22538472401.60 rows=3D65363905182 wi=
dth=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000  (cost=3D0=
.12..2.34 rows=3D1 width=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999  (cost=3D0.=
12..2.34 rows=3D1 width=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998  (cost=3D0.=
12..2.34 rows=3D1 width=3D4)
[...]
                ->  Index Only Scan Backward using test_runs_raw__part_max=
12460k_pkey on test_runs_raw__part_max12460k test_runs_raw_623  (cost=3D0.=
57..12329614.53 rows=3D121368496 width=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
12440k_pkey on test_runs_raw__part_max12440k test_runs_raw_622  (cost=3D0.=
57..5180832.16 rows=3D184927264 width=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
12420k_pkey on test_runs_raw__part_max12420k test_runs_raw_621  (cost=3D0.=
57..4544964.21 rows=3D82018824 width=3D4)
[...]

Overall I think there are two issues that postgres could handle better
here:

1. Avoid the need for manual ANALYZE on partitioned table

2. Create a different parallel plan, one that can exit early, when the
    LIMIT is proportionally low. I feel the partitions could be
    parallel-scanned in-order, so that the whole thing stops when one
    partition has been read.

Thank you!
Dimitris