Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s5V8Q-005h38-73 for pgsql-general@arkaria.postgresql.org; Fri, 10 May 2024 18:38:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s5V8N-000P12-2f for pgsql-general@arkaria.postgresql.org; Fri, 10 May 2024 18:38:19 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s5V8M-000P0b-JX for pgsql-general@lists.postgresql.org; Fri, 10 May 2024 18:38:19 +0000 Received: from mout.gmx.net ([212.227.17.22]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s5V8J-000NJI-MZ for pgsql-general@lists.postgresql.org; Fri, 10 May 2024 18:38:17 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.net; s=s31663417; t=1715366292; x=1715971092; i=jimis@gmx.net; bh=lOJy2V767AWaaoPDiaO8NH8C8I0BCkZ1Bdj5Tb+L4pY=; h=X-UI-Sender-Class:Date:From:To:Subject:Message-ID:MIME-Version: Content-Type:Content-Transfer-Encoding:cc: content-transfer-encoding:content-type:date:from:message-id: mime-version:reply-to:subject:to; b=YtREimIadjsvF/bH8rPdKnS0lGJxJ47PbprV7aRqsiJXdKaFzjkYPgNzgrSMaIHL GRipth7QqabQPrUIh7rUc3IFhQRcP3a/VEKkf/RuLp3Z3q0TPBvnnoGLxTTK5uyim n0BkoxZ31qOKOx9fydL25iYO+qTztybrJXYp2c7D+lDfAEyTtSZspk2T5kwUXX42o sr7/vG+GN80O6ORVTUnqzMq8yxDoWqrz9KrAxTWv3a/lJYutrcUj7XV2vUc1/di4F JPb5fr78MCS86ces6WPVO/VvcyEQc2ujm2thoWmB3v8/8jBzf2PDJubNnvkA+o7fy OMLXxEFW1Bea9j1a9w== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [10.9.70.35] ([185.55.106.54]) by mail.gmx.net (mrgmx104 [212.227.17.168]) with ESMTPSA (Nemesis) id 1Mf078-1sXhHC242U-00fiB2 for ; Fri, 10 May 2024 20:38:12 +0200 Date: Fri, 10 May 2024 20:38:07 +0200 (CEST) From: Dimitrios Apostolou To: pgsql-general@lists.postgresql.org Subject: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions Message-ID: <7886a68f-b466-2131-1747-f69f0fb71a37@gmx.net> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=US-ASCII X-Provags-ID: V03:K1:r1Q+FfAOYRkCLXtqwhSM/xIEhvL+vwYa2zvNguBf4QaBNahu7i3 5Wz8CaDwqJGR2CJKKUclp42ez2ByoRWX+sWaX4RpRD9fEFLKDEK7GID8XEaqaRZC6AnSVeM gqZgQvw/PUOHpa0MwCMpapXqiOElAqngTVimLpJZJeP2dHmNKiz/6K/FriC5q8UMc1lil9w tNqh+AqWRLAKcRHNZcrbg== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:Rho04Q+9Vqk=;h1TGZHEf3FMmOXwoyeEDniwLqvn EuDw0KlLINmd6J2lTTVUoiXHTtR/Nb3e7hekFeUQv2jl5+Ihe0tvGtXUqkIZHlBJscpSI+2r2 ykF+M7jw6BOrn9CNppYiHWdfW8qpZS00WINHxTNd9FqTcCxNUHgvu6RLud5Lphe9Shec11xjW iA50be5MuE7uyEGBdCc7QQ/muoqtG448ZzUYts+Mvg+eY0k/aUdeDsK/sKvCNL3NvM+vcxtRq QfOl1xHeYaVm68Im0urCdsNVUSklQj7Wae0nh67HLIgUzflDnw4g91Oc0s72Fr+4tfxQW5vhS RBATZCpvSxeTAAtaSV0frIYy6MF+4tov9DDSH1KuCFTF98hIxA9D/3XDMa9BEQO9uwZUaSdkz RWz1BYlrO6CLmyPA4+65rEe8InbW8GhvkRO1oP+HRwHHh9LZ3+34iJgzfkjQYxvfdn4Mk6g8h jUg53V81lMnCY08I6Y/vtP0TAtJ5Ic0gPpc/zI8d3UEIKSsOka1XNgP1i2X1/TQoxTlngs3bp cgwFqhvpTZEo10Jm+he7vL7WA7MQtTdfdwhyQtp1GDv7tslMAJ270hUNUkCYc75vkhGYVy8UQ 6wW2dCTEbB46HQrY3cD3dA8WYnqEzgpu84EvHjYCRwPDzBTM7t1OxwzvhwyQfxHcCVIm47fxS qlxSbkU07n6oSJfZPGc6bOEO5bQfUZL59mM/xoFikH4m6KuNEohCEnreBEbsy44KjI9AsVMHq lCUTTXo2QpaMr+iwW4bxkQPYgjHb8fNReumcsPcqwfXjWchl/PDhbbe3FNQgzOCh6pYcXKXa0 4lXgxRPcsdpDl8x67VSDaRAS+N/neINbcMoe5Obrxo6Y0= Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hello list, INTRO I have a huge (multi-billion rows) table partitioned into 1000 partitions. Around half of the partitions are full and the rest are empty, created in advance ready to receive future incoming data. Postgres is 16.2. Here are the relevant parts of the schema: > \d test_runs_raw Partitioned table "public.test_runs_raw" Column | Type | Collation | Nullable | Default =2D------------------+-----------------------------+-----------+----------= +---------------------------------- run_n | bigint | | not null |= generated by default as identity test_case_n | smallint | | not null | workitem_n | integer | | not null | test_resulttype_n | smallint | | | Partition key: RANGE (workitem_n) Indexes: "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_= n) Each partition is made to keep entries with workitem_n in ranges (0,20k), (20k,40k) and so on (k =3D kilo) up to 20000k. PROBLEM I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIM= IT 10; What is remarkable, is that in 998 out of 1000 table scans it involves, the planner does not use the index. Instead it chooses a sequential scan. Here is the output from EXPLAIN: Limit (cost=3D853891608.79..853891608.99 rows=3D10 width=3D4) -> Unique (cost=3D853891608.79..853891612.79 rows=3D200 width=3D4) -> Sort (cost=3D853891608.79..853891610.79 rows=3D800 width=3D= 4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=3D853891488.22..853891570.22 rows=3D800 = width=3D4) Workers Planned: 4 -> HashAggregate (cost=3D853890488.22..853890490.2= 2 rows=3D200 width=3D4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=3D0.00..813118117.3= 0 rows=3D16308948365 width=3D4) -> Parallel Index Only Scan Backward us= ing test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test= _runs_raw_480 (cost=3D0.57..1597355.10 rows=3D33623320 width=3D4) -> Parallel Index Only Scan Backward us= ing test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k te= st_runs_raw_507 (cost=3D0.57..1210795.63 rows=3D25793672 width=3D4) -> Parallel Seq Scan on test_runs_raw__= part_max9500k test_runs_raw_475 (cost=3D0.00..3037793.12 rows=3D64121612 = width=3D4) -> Parallel Seq Scan on test_runs_raw__= part_max11180k test_runs_raw_559 (cost=3D0.00..2918875.90 rows=3D61612190= width=3D4) [ ... 996 more sequential scans ... ] If I remove DISTINCT then the plan changes dramatically and it runs instantaneously: Limit (cost=3D363.84..367.30 rows=3D10 width=3D4) -> Append (cost=3D363.84..22527480551.58 rows=3D65235793929 width=3D= 4) -> Index Only Scan Backward using test_runs_raw__part_max20000k= _pkey on test_runs_raw__part_max20000k test_runs_raw_1000 (cost=3D0.12..2= .34 rows=3D1 width=3D4) -> Index Only Scan Backward using test_runs_raw__part_max19980k= _pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=3D0.12..2.= 34 rows=3D1 width=3D4) -> Index Only Scan Backward using test_runs_raw__part_max19960k= _pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=3D0.12..2.= 34 rows=3D1 width=3D4) -> Index Only Scan Backward using test_runs_raw__part_max19940k= _pkey on test_runs_raw__part_max19940k test_runs_raw_997 (cost=3D0.12..2.= 34 rows=3D1 width=3D4) [ ... 996 more index scans ... ] Notice how in the last plan there is no parallel scanning. Instead the partitions are scanned sequentially, *in proper order*, so that the plan execution stops after reading the first 10 rows in the first non-empty partition. Why can't the same be done with DISTINCT? Please note that the workitem_n value range is well spread into in range (0,13M) and the table has been gradually filled within one year, so I'm assuming the vacuum worker has worked long enough to build sane statistics (not sure how to verify that). REMARKS 1. I tried reproducing the problem on an artificial table with few partitions and few values, but I couldn't. Both queries execute fast, and the planner is always choosing a non-parallel index-only scan. 2. Among testing changes to various settings, I just noticed that setting max_parallel_workers_per_gather to 0 (from the original value of 4) fixes the issue! On the original huge table, disabling parallelism actually makes the query infinitely faster and it returns within 1s! I= s this a bug in the planner? Thank you, Dimitris