Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wC7xu-001gjs-04 for pgsql-hackers@arkaria.postgresql.org; Mon, 13 Apr 2026 03:27:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wC7xr-004Lav-0f for pgsql-hackers@arkaria.postgresql.org; Mon, 13 Apr 2026 03:27:56 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wC7xq-004Lan-2q for pgsql-hackers@lists.postgresql.org; Mon, 13 Apr 2026 03:27:55 +0000 Received: from mail-ej1-x62c.google.com ([2a00:1450:4864:20::62c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wC7xp-00000000lqB-1y84 for pgsql-hackers@lists.postgresql.org; Mon, 13 Apr 2026 03:27:55 +0000 Received: by mail-ej1-x62c.google.com with SMTP id a640c23a62f3a-b9825ba7f9dso564708766b.0 for ; Sun, 12 Apr 2026 20:27:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1776050872; cv=none; d=google.com; s=arc-20240605; b=XbFJJ5vfzYRglM5zsVmaV3vIsvYS2NxLZqUeiLoi2UBvKAjMBmnDV8enqDIWY70Okm pWtamXt4SE7cbXQTy/a4MnC/RRk7qBt2OSB2ItOmC+qQtuWFp493xR/VXbVILzMtcKXX b/V7aE8hyjj/BdNFFOpLvcMiX6pqJ22UdktobM8vvGXI66ZC4U0UyHr9o5SLLaZMtJEp opFKxhhX0o6FF559i/aRpMtDsA0Y7sdHR7XTWvvFbcvzLrjmtCllEO75c40ZbQWW6M+9 RcbhCptcIeM4e+MB4EllM/Brr46yM8D2+L2IDUoIJA9nyt0nY0WpNHEi0N+SvCTKZElE 5ycA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=QUBboQYQv5On4UayHW5mvbk0QozVR55a3tdtA0h5aro=; fh=NOo30AAV3JZ2gV+RE/nn4WjuTPJNrVMgF4c+LPp6+pw=; b=ZJt18WqcmxfTd0ghatKUWHO7EnUsgUAaGR+g8fFQiYt83VGEg+djG1wVbp0sRA7b/E r3zbu/uFvxzeu6fL5cp2nV/lGK53pSE40jB1iNGpd9uiCwZTckMceDG2RCkN1QXpNTm+ R20YIqSChSa9PTcJY8PXytSKfND1ktS894X3GOrZ1h35QcXtNC6owfxpi9HqbLSnzcoj XTQn4yBhyI9cP+Tvfv5iuTy376u7NS/fG0i9lrkRP1c3FoWF92NrFKZbePP8KG4VVy/F Xbg42APycmOsqUcxy9fFvCGxMwkq6H1m6YZm7nk9AXfDj+u4OS7BoGoiIq6phs7pzk0I 3Dfw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1776050872; x=1776655672; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=QUBboQYQv5On4UayHW5mvbk0QozVR55a3tdtA0h5aro=; b=a/pzUnxR5RKnllJWtv2r+ogguBdRSVqn+q3XoLSveopvNM1c6LBD32SZ3Z+th5ju0O MY6f7mjqKY0FatTyv+va7bY4gCilTV9IgJm8T1wSpfmkcQS6OmeRoGKNZ1OBc3hqAUzg E+hO5+df2kY9XrnkSGMyZJ0mxeK2e5C77kddDwuXocaeytH210ONVTp0ls4BFylVtzWa 98ghMxQJn6dEKmOg2fLKoOEuTDL3vSAi6to3AcRwKBx1XdEd5J3QF3yL/JL1qXfMPqBW 3R3za/bOdlkDWxSEA8lG/A4Y5VmWmOak968IUscf7oiRhxZgDaWNDOpEb44UFvnmM/JR 2AKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776050872; x=1776655672; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=QUBboQYQv5On4UayHW5mvbk0QozVR55a3tdtA0h5aro=; b=dQ76DlLwVZy2PaYAsy047CdsQSyAjokn7gbZWZIp1Db7ndeDFu4o3AcsS1GxYp+CSn 76nyVwypeIip353pjvr83Ci8RNZgzCefgA3J5rv2Pc6HmElePBbpP3fr6iXTdkWduZTd W6TbgsJ4OUqe8mdiAzyFcRL34fHdqfoT4lIJM1AR5NCWXR7Wel7F713kz1AjA1E+Lfph 8MCgrYllmqzQB1c68Sex4HlOX6VLny4LsMqGGk3FWhfE6RK0YDStJESLIkSpdlnbmQIk 8OUyOfpvk5DJFMD2yiisdXqTjVAHvSoj2ajSn8igvYYj2fOXjUm20BnenvTrhn0xklSb AVOA== X-Gm-Message-State: AOJu0YxEfoD8o4N/dy0G6FqgvIBrzBsMcT2qjPEaN1695VbjsmL+UKQf 00YcJ46k6chNc+vp75w8l5J9/eu5INtq3wZcX7+wxt6dFPFL1VonGne56Ey8hq48UmuOfK1wooX mFt6LZf6e5o0Bct0dwDVt9EOxZKxGzVA= X-Gm-Gg: AeBDietYqDVm9XZofHRzYNOOlO257nhxJoS0MaJ0s6cyimEa+UDpO96wCvvMC2999wD 9LAAqnMl07b8S0MvQ9Myd7sxeZ4c3Lkve/tOsX6aqj0aTe/fAT+RKmYPW6fQQI+oGTuLe/UiHdK VTyZP8Oz13OW8OW0hh3Ju+tFy8htZGC7MbLAxNO7dk0q1CJ4+MH2/YO9KMmhpvy9BZ8ySvT03Mz bcoKNk6zplHr3C6REDf+M+lgVqyA32CpaT0vsE3+K7/4P6Gt92SrwUkTYKX+LD9fxgDJp0zQf7X wA0t/qmvlqpMdO0/wX5jDQNcCeoCz9ASLrPiba54VZviDBUK X-Received: by 2002:a17:907:2d29:b0:b9c:b178:36c3 with SMTP id a640c23a62f3a-b9d7684d836mr569008466b.2.1776050871845; Sun, 12 Apr 2026 20:27:51 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: wenhui qiu Date: Mon, 13 Apr 2026 11:27:39 +0800 X-Gm-Features: AQROBzCM9DXfbYiSz15zifhrmyQn9kOJts-17gTaJ3ux9yQyCspPbBQ4HIZFRuQ Message-ID: Subject: Re: Propagate stadistinct through GROUP BY/DISTINCT in subqueries and CTEs To: Richard Guo Cc: Pg Hackers Content-Type: multipart/alternative; boundary="00000000000051c7e1064f4f11ca" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000051c7e1064f4f11ca Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable HI Richard > + /* Convert negative stadistinct to absolute count */ > + if (stats->stadistinct < 0) > + { > + RelOptInfo *baserel =3D find_base_rel(subroot, var->varno); > + > + if (baserel->tuples > 0) > + { > + stats->stadistinct =3D (float4) > + clamp_row_est(-stats->stadistinct * baserel->tuples); > + } > + } Thanks so much for working on this! While looking at the negative stadistinct conversion, I was wondering if we might run into a potential edge case with multi-level nested subqueries. What do you think? /* Convert negative stadistinct to absolute count */ if (stats->stadistinct < 0) { - RelOptInfo *baserel =3D find_base_rel(subroot, var->varno); + RelOptInfo *baserel =3D vardata->rel; - if (baserel->tuples > 0) + if (baserel && baserel->tuples > 0) { stats->stadistinct =3D (float4) clamp_row_est(-stats->stadistinct * baserel->tuples); } } Thanks On Mon, Apr 13, 2026 at 9:19=E2=80=AFAM Richard Guo wrote: > I complained in [1] that some TPC-DS queries suffer from very poor > cardinality estimates on CTE scan filters, to the point that simply > disabling nestloop makes some queries run hundreds of times faster. > Here's a simple reproduction: > > create table t (a int, b int, c int); > insert into t select i%2, i, i from generate_series(1,1000) i; > analyze t; > > explain analyze > with cte as (select a, b, avg(c) as avg from t group by a, b) > select * from cte t1, cte t2 > where t1.a =3D 1 and t2.a =3D 1 and t1.avg =3D t2.avg; > > Column 'a' has only 2 distinct values, so the filter a=3D1 on the > 1000-row CTE output should estimate ~500 rows (assuming these values > are equally common). Instead, the CTE scan estimates 5 rows (1000 * > 1/200) because examine_simple_variable returns early when the subquery > has GROUP BY, and selectivity estimation falls back on > 1/DEFAULT_NUM_DISTINCT. > > -> CTE Scan on cte t1 (cost=3D0.00..22.50 rows=3D5 width=3D40) > (actual time=3D4.874..5.053 rows=3D500.00 loops=3D1) > > As a result, this query ends up with a Nested Loop plan, and the > Execution Time is 192.907 ms. > > For DISTINCT or GROUP BY key columns that are simple Vars, I think we > can propagate stadistinct from the base table, because the set of > distinct values is preserved after grouping. MCV frequencies, > histograms, and correlation data are not valid since GROUP BY and > DISTINCT change the frequency distribution, but with stadistinct > alone, callers like var_eq_const() can use a 1/ndistinct estimate > rather than 1/DEFAULT_NUM_DISTINCT. > > Attached is a patch to do this. With the patch, the example above > estimates 500 rows ... > > -> CTE Scan on cte t1 (cost=3D0.00..22.50 rows=3D500 width=3D40) > (actual time=3D3.785..4.143 rows=3D500.00 loops=3D1) > > ... and chooses a Hash Join, with an Execution Time of 8.238 ms (~20x > faster). > > I tested this patch on TPC-DS query 31: > > -- on master: > Planning Time: 5.207 ms > Execution Time: 1536140.258 ms > > -- on patched: > Planning Time: 5.140 ms > Execution Time: 1149.482 ms > > Over 1300x faster. > > Does this approach make sense? Any thoughts? > > [1] > https://postgr.es/m/CAMbWs4-QU_nkFqFZLdzWRsEsVE8aLWx4qBBVq7g4rXw+cvYDMg@m= ail.gmail.com > > - Richard > --00000000000051c7e1064f4f11ca Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
HI Richard

> + /* Con= vert negative stadistinct to absolute count */
> + if (stats->stad= istinct < 0)
> + {
> + RelOptInfo *baserel =3D find_base_re= l(subroot, var->varno);
> +
> + if (baserel->tuples >= 0)
> + {
> + stats->stadistinct =3D (float4)
> + = clamp_row_est(-stats->stadistinct * baserel->tuples);
> + }<= br>> + }

Thanks so much for working= on this! While looking at the negative stadistinct conversion, I was wonde= ring if we might run into a potential edge case with multi-level nested sub= queries. What do you think?

/* Convert= negative stadistinct to absolute count */

=C2=A0 =C2=A0 if (stats->stadistinct < 0)
=C2=A0 =C2=A0 {- =C2=A0 =C2=A0 =C2=A0 RelOptInfo *baserel =3D find_base_rel(subroot, var-= >varno);
+ =C2=A0 =C2=A0 =C2=A0 RelOptInfo *baserel =3D vardata->r= el;

- =C2=A0 =C2=A0 =C2=A0 if (baserel->tuples > 0)
+ =C2= =A0 =C2=A0 =C2=A0 if (baserel && baserel->tuples > 0)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sta= ts->stadistinct =3D (float4)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 clamp_row_est(-stats->stadistinct * baserel->tuples= );
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 }

Thanks=C2=A0


On Mon, Apr 13, 2026= at 9:19=E2=80=AFAM Richard Guo <guofenglinux@gmail.com> wrote:
I complained in [1] that some TPC-DS queries suffe= r from very poor
cardinality estimates on CTE scan filters, to the point that simply
disabling nestloop makes some queries run hundreds of times faster.
Here's a simple reproduction:

create table t (a int, b int, c int);
insert into t select i%2, i, i from generate_series(1,1000) i;
analyze t;

explain analyze
with cte as (select a, b, avg(c) as avg from t group by a, b)
select * from cte t1, cte t2
where t1.a =3D 1 and t2.a =3D 1 and t1.avg =3D t2.avg;

Column 'a' has only 2 distinct values, so the filter a=3D1 on the 1000-row CTE output should estimate ~500 rows (assuming these values
are equally common).=C2=A0 Instead, the CTE scan estimates 5 rows (1000 * 1/200) because examine_simple_variable returns early when the subquery
has GROUP BY, and selectivity estimation falls back on
1/DEFAULT_NUM_DISTINCT.

=C2=A0 =C2=A0->=C2=A0 CTE Scan on cte t1=C2=A0 (cost=3D0.00..22.50 rows= =3D5 width=3D40)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(actua= l time=3D4.874..5.053 rows=3D500.00 loops=3D1)

As a result, this query ends up with a Nested Loop plan, and the
Execution Time is 192.907 ms.

For DISTINCT or GROUP BY key columns that are simple Vars, I think we
can propagate stadistinct from the base table, because the set of
distinct values is preserved after grouping.=C2=A0 MCV frequencies,
histograms, and correlation data are not valid since GROUP BY and
DISTINCT change the frequency distribution, but with stadistinct
alone, callers like var_eq_const() can use a 1/ndistinct estimate
rather than 1/DEFAULT_NUM_DISTINCT.

Attached is a patch to do this.=C2=A0 With the patch, the example above
estimates 500 rows ...

=C2=A0 =C2=A0->=C2=A0 CTE Scan on cte t1=C2=A0 (cost=3D0.00..22.50 rows= =3D500 width=3D40)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(actua= l time=3D3.785..4.143 rows=3D500.00 loops=3D1)

... and chooses a Hash Join, with an Execution Time of 8.238 ms (~20x
faster).

I tested this patch on TPC-DS query 31:

-- on master:
=C2=A0Planning Time: 5.207 ms
=C2=A0Execution Time: 1536140.258 ms

-- on patched:
=C2=A0Planning Time: 5.140 ms
=C2=A0Execution Time: 1149.482 ms

Over 1300x faster.

Does this approach make sense?=C2=A0 Any thoughts?

[1] https://po= stgr.es/m/CAMbWs4-QU_nkFqFZLdzWRsEsVE8aLWx4qBBVq7g4rXw+cvYDMg@mail.gmail.co= m

- Richard
--00000000000051c7e1064f4f11ca--