Date: Tue, 14 May 2024 20:11:26 +0200 (CEST)
From: Dimitrios Apostolou <jimis@gmx.net>
To: David Rowley <dgrowleyml@gmail.com>
cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-general@lists.postgresql.org
Subject: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan
 on huge table with 1000 partitions
In-Reply-To: <CAApHDvo8yYvqa1+bkW_f5xHX-gmKGYfaGwH+Y_KP-=9TOuF+-g@mail.gmail.com>
Message-ID: <46f9dc8a-ff83-d086-f726-9df0a0d39c97@gmx.net>
References: <7886a68f-b466-2131-1747-f69f0fb71a37@gmx.net> <69077f15-4125-2d63-733f-21ce6eac4f01@gmx.net> <559b0e40-63e6-fa9a-6b03-d1eba10f30f8@gmx.net> <1629463.1715372568@sss.pgh.pa.us> <6a2f3906-3d7a-6924-7403-8f77d57a18e4@gmx.net>
 <CAApHDvrtTKfh7HgAyXBd3KN0s-jxiHzW7sWdm-sFEjP6fGPCkg@mail.gmail.com> <3efce60b-48c0-160d-0444-474b02f76739@gmx.net> <CAApHDvo8yYvqa1+bkW_f5xHX-gmKGYfaGwH+Y_KP-=9TOuF+-g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
UI-OutboundReport: notjunk:1;M01:P0:iKgZ/aZU1T0=;Bx4OgOki1tFNXHmkKliSdj8yk3p
 bRACdGpF6RTCS1Ba/9+i0Rb5d8ukO1l7qH0xD4FjV8Lc1NP8IeeG8ujEElObCLb2a6Hturkgy
 EGO3Y/u69N1/SWOmHDje8SzRJiSChZRg+HDQr0OGejvuCFYU/KgbLTxZEz6SLPTPsBdATyd3B
 QpWO5+X/RubtNbVpLEzl09MMb4rvGYT9CyOWse57erF+Hg32EQdwunKHKwoLMPTZu052awL7V
 jDjIrHoIdZX5ECJtaLztvp7OP3O7AaYfgVCK5YGcZkDznsA5EAWP6x8YFqjwozmfpfH+XIOQo
 2BSX8fO0SyipIK0/13mEnaA3hwsos5z5ZrQoB2af9K0wxmiMz4fRFXlJCWYEkp5v5xvNps8Mk
 JuFag90PEUnY7iIb2zl5fHq4xAqokSWBwvfyVZjhw5vxX70D3LE5J5FHI3VYEbS3zGjNjPLQL
 kzqcwJouSfX/81vsRMEa5PZtaHwUuRDmGOth6RolOKybwx8USYejMcUHbRLGCtd/sIA1J7ztd
 3wLwStEsoPvxbK5k9LiNPFM3MjWd8ZguMAU2pAcB4BDnXbzbU69rua3F8dimegxKIKYLby3Tn
 X305DcRSCQcM14vZ+gcItElVrx+C5jXD022tboRsERG3TD74bOZNawVFfGa6doY8eujUAd62I
 JpbkOMjIUeN0se63PI0d5EYuy37CFCM2rw7ssIYNVypbvhxn8DWIu23gMhX8PUSkPh8tQYcqk
 kxYNuXb62KJ6hs49qSTRK86gQCkPpUWiQgDJ4LEWivJh5xeSC1Xpp/Q9mRkRRuMEpi0Kl2etb
 zVjKN5NgiXdyDz1TB0sdxEfhKVo+aJKdSPD4tOBZmrJjE=
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/46f9dc8a-ff83-d086-f726-9df0a0d39c97%40gmx.net>
Precedence: bulk

On Tue, 14 May 2024, David Rowley wrote:
>
> If you were to put the n_distinct_inherited estimate back to 200 and
> disable sort, you should see the costs are higher for the index plan.
> If that's not the case then there might be a bug.  It seems more
> likely that due to the n_distinct estimate being so low that the
> planner thought that a large enough fraction of the rows needed to be
> read and that made the non-index plan appear cheaper.
>
> I'd be interested in seeing what the costs are for the index plan. I
> think the following will give you that (untested):
>
> alter table test_runs_raw alter column workitem_n set
> (n_distinct_inherited=3D200);
> analyze test_runs_raw;

I had to stop this step because it was taking too long going through all
partitions again. But it seems it had the desired effect.

> set enable_sort=3D0;
> explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_=
n DESC LIMIT 10;

It chooses the non-parallel index plan:

  Limit  (cost=3D365.17..1135517462.36 rows=3D10 width=3D4)
    ->  Unique  (cost=3D365.17..22710342308.83 rows=3D200 width=3D4)
          ->  Append  (cost=3D365.17..22546660655.46 rows=3D65472661350 wi=
dth=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000  (cost=3D0=
.12..2.34 rows=3D1 width=3D4)
                ->  Index Only Scan Backward using test_runs_raw__part_max=
19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999  (cost=3D0.=
12..2.34 rows=3D1 width=3D4)
[... only index scans follow]

LIMIT 100 goes for the parallel seqscan plan, that even contains a sort!
But it seams to me that the extra upper level HashAggregate step raises
the cost by an order of magnitude, from 800M to 10G, in comparison to
doing (Unique->Sort) - see plan in the next paragraph.

  Limit  (cost=3D10857220388.76..10857220389.01 rows=3D100 width=3D4)
    ->  Sort  (cost=3D10857220388.76..10857220389.26 rows=3D200 width=3D4)
          Sort Key: test_runs_raw.workitem_n DESC
          ->  HashAggregate  (cost=3D857220379.12..857220381.12 rows=3D200=
 width=3D4)
                Group Key: test_runs_raw.workitem_n
                ->  Gather  (cost=3D857220295.12..857220377.12 rows=3D800 =
width=3D4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=3D857219295.12..857219297.1=
2 rows=3D200 width=3D4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=3D0.00..816295259.2=
1 rows=3D16369614363 width=3D4)
                                  ->  Parallel Index Only Scan Backward us=
ing test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test=
_runs_raw_480  (cost=3D0.57..1597356.30 rows=3D33623360 width=3D4)
                                  ->  Parallel Index Only Scan Backward us=
ing test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k te=
st_runs_raw_507  (cost=3D0.57..1210806.37 rows=3D25794030 width=3D4)
                                  ->  Parallel Seq Scan on test_runs_raw__=
part_max9500k test_runs_raw_475  (cost=3D0.00..3037800.88 rows=3D64122388 =
width=3D4)
                                  ->  Parallel Seq Scan on test_runs_raw__=
part_max11180k test_runs_raw_559  (cost=3D0.00..2918865.36 rows=3D61611136=
 width=3D4)
[... only seqscans follow]


If I re-enable sort, then it goes for the parallel seqscan plan even with =
LIMIT 10:

SET SESSION enable_sort TO TRUE;
EXPLAIN  SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n=
 DESC LIMIT 10;

  Limit  (cost=3D857166256.39..857166256.59 rows=3D10 width=3D4)
    ->  Unique  (cost=3D857166256.39..857166260.39 rows=3D200 width=3D4)
          ->  Sort  (cost=3D857166256.39..857166258.39 rows=3D800 width=3D=
4)
                Sort Key: test_runs_raw.workitem_n DESC
                ->  Gather  (cost=3D857166135.82..857166217.82 rows=3D800 =
width=3D4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=3D857165135.82..857165137.8=
2 rows=3D200 width=3D4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=3D0.00..816243567.2=
4 rows=3D16368627432 width=3D4)
                                  ->  Parallel Index Only Scan Backward us=
ing test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test=
_runs_raw_480  (cost=3D0.57..1597356.30 rows=3D33623360 width=3D4)
                                  ->  Parallel Index Only Scan Backward us=
ing test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k te=
st_runs_raw_507  (cost=3D0.57..1210806.37 rows=3D25794030 width=3D4)
                                  ->  Parallel Seq Scan on test_runs_raw__=
part_max9500k test_runs_raw_475  (cost=3D0.00..3037800.88 rows=3D64122388 =
width=3D4)
                                  ->  Parallel Seq Scan on test_runs_raw__=
part_max11180k test_runs_raw_559  (cost=3D0.00..2918865.36 rows=3D61611136=
 width=3D4)
[... only seqscans follow]

So in order of higher to lower cost, we have the following alternatives:

1. non-parallel index scan  (800M)
2. parallel seqscan with sort  (1.3G)
3. parallel seqscan without sort but actually has a sort  (10G assuming it=
's the same as for LIMIT 100)

>
> -- undo
> alter table test_runs_raw alter column workitem_n set (n_distinct_inheri=
ted=3D-1);

I believe I need to set it to 0 to be back to defaults.

> reset enable_sort;
>


Regards,
Dimitris