Date: Fri, 10 May 2024 20:38:07 +0200 (CEST)
From: Dimitrios Apostolou <jimis@gmx.net>
To: pgsql-general@lists.postgresql.org
Subject: SELECT DISTINCT chooses parallel seqscan instead of indexscan on
 huge table with 1000 partitions
Message-ID: <7886a68f-b466-2131-1747-f69f0fb71a37@gmx.net>
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset=US-ASCII
UI-OutboundReport: notjunk:1;M01:P0:Rho04Q+9Vqk=;h1TGZHEf3FMmOXwoyeEDniwLqvn
 EuDw0KlLINmd6J2lTTVUoiXHTtR/Nb3e7hekFeUQv2jl5+Ihe0tvGtXUqkIZHlBJscpSI+2r2
 ykF+M7jw6BOrn9CNppYiHWdfW8qpZS00WINHxTNd9FqTcCxNUHgvu6RLud5Lphe9Shec11xjW
 iA50be5MuE7uyEGBdCc7QQ/muoqtG448ZzUYts+Mvg+eY0k/aUdeDsK/sKvCNL3NvM+vcxtRq
 QfOl1xHeYaVm68Im0urCdsNVUSklQj7Wae0nh67HLIgUzflDnw4g91Oc0s72Fr+4tfxQW5vhS
 RBATZCpvSxeTAAtaSV0frIYy6MF+4tov9DDSH1KuCFTF98hIxA9D/3XDMa9BEQO9uwZUaSdkz
 RWz1BYlrO6CLmyPA4+65rEe8InbW8GhvkRO1oP+HRwHHh9LZ3+34iJgzfkjQYxvfdn4Mk6g8h
 jUg53V81lMnCY08I6Y/vtP0TAtJ5Ic0gPpc/zI8d3UEIKSsOka1XNgP1i2X1/TQoxTlngs3bp
 cgwFqhvpTZEo10Jm+he7vL7WA7MQtTdfdwhyQtp1GDv7tslMAJ270hUNUkCYc75vkhGYVy8UQ
 6wW2dCTEbB46HQrY3cD3dA8WYnqEzgpu84EvHjYCRwPDzBTM7t1OxwzvhwyQfxHcCVIm47fxS
 qlxSbkU07n6oSJfZPGc6bOEO5bQfUZL59mM/xoFikH4m6KuNEohCEnreBEbsy44KjI9AsVMHq
 lCUTTXo2QpaMr+iwW4bxkQPYgjHb8fNReumcsPcqwfXjWchl/PDhbbe3FNQgzOCh6pYcXKXa0
 4lXgxRPcsdpDl8x67VSDaRAS+N/neINbcMoe5Obrxo6Y0=
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/7886a68f-b466-2131-1747-f69f0fb71a37%40gmx.net>
Precedence: bulk

Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:

> \d test_runs_raw
                                  Partitioned table "public.test_runs_raw"
       Column       |            Type             | Collation | Nullable |
Default
=2D------------------+-----------------------------+-----------+----------=
+----------------------------------
  run_n             | bigint                      |           | not null |=
 generated by default as identity
  test_case_n       | smallint                    |           | not null |
  workitem_n        | integer                     |           | not null |
  test_resulttype_n | smallint                    |           |          |
Partition key: RANGE (workitem_n)
Indexes:
     "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_=
n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k =3D kilo) up to 20000k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIM=
IT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

  Limit  (cost=3D853891608.79..853891608.99 rows=3D10 width=3D4)
    ->  Unique  (cost=3D853891608.79..853891612.79 rows=3D200 width=3D4)
          ->  Sort  (cost=3D853891608.79..853891610.79 rows=3D800 width=3D=
4)               Sort Key: test_runs_raw.workitem_n DESC
                ->  Gather  (cost=3D853891488.22..853891570.22 rows=3D800 =
width=3D4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=3D853890488.22..853890490.2=
2 rows=3D200 width=3D4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=3D0.00..813118117.3=
0 rows=3D16308948365 width=3D4)
                                  ->  Parallel Index Only Scan Backward us=
ing test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test=
_runs_raw_480  (cost=3D0.57..1597355.10 rows=3D33623320 width=3D4)
                                  ->  Parallel Index Only Scan Backward us=
ing test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k te=
st_runs_raw_507  (cost=3D0.57..1210795.63 rows=3D25793672 width=3D4)
                                  ->  Parallel Seq Scan on test_runs_raw__=
part_max9500k test_runs_raw_475  (cost=3D0.00..3037793.12 rows=3D64121612 =
width=3D4)
                                  ->  Parallel Seq Scan on test_runs_raw__=
part_max11180k test_runs_raw_559  (cost=3D0.00..2918875.90 rows=3D61612190=
 width=3D4)
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

  Limit  (cost=3D363.84..367.30 rows=3D10 width=3D4)
    ->  Append  (cost=3D363.84..22527480551.58 rows=3D65235793929 width=3D=
4)
          ->  Index Only Scan Backward using test_runs_raw__part_max20000k=
_pkey on test_runs_raw__part_max20000k test_runs_raw_1000  (cost=3D0.12..2=
.34 rows=3D1 width=3D4)
          ->  Index Only Scan Backward using test_runs_raw__part_max19980k=
_pkey on test_runs_raw__part_max19980k test_runs_raw_999  (cost=3D0.12..2.=
34 rows=3D1 width=3D4)
          ->  Index Only Scan Backward using test_runs_raw__part_max19960k=
_pkey on test_runs_raw__part_max19960k test_runs_raw_998  (cost=3D0.12..2.=
34 rows=3D1 width=3D4)
          ->  Index Only Scan Backward using test_runs_raw__part_max19940k=
_pkey on test_runs_raw__part_max19940k test_runs_raw_997  (cost=3D0.12..2.=
34 rows=3D1 width=3D4)
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
    partitions and few values, but I couldn't. Both queries execute fast,
    and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
    max_parallel_workers_per_gather to 0 (from the original value of 4)
    fixes the issue! On the original huge table, disabling parallelism
    actually makes the query infinitely faster and it returns within 1s! I=
s
    this a bug in the planner?


Thank you,
Dimitris