Extended statistics improvement: multi-column MCV missing values

public inbox for [email protected]  
help / color / mirror / Atom feed

Extended statistics improvement: multi-column MCV missing values
3+ messages / 2 participants
[nested] [flat]

* Extended statistics improvement: multi-column MCV missing values
@ 2026-05-18 16:09 Enrique Sánchez <[email protected]>
  2026-05-18 22:13 ` Re: Extended statistics improvement: multi-column MCV missing values Ilia Evdokimov <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Enrique Sánchez @ 2026-05-18 16:09 UTC (permalink / raw)
  To: [email protected]

Hi,

Postgres only uses multi-column MCVs when the value we are looking for is
in the list. If not, it falls back into individual independent statistics
to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable
information that it currently throws away: we know that the combination's
frequency is strictly bounded by the frequency of the last (least common)
item in that MCV list.

Test case
=========
```
-- 1. Create a minimal table
DROP TABLE IF EXISTS planner_trap;
CREATE TABLE planner_trap (
    col_a integer,
    col_b integer,
    col_c integer,
    sort_col integer
);

-- 2. 1 becomes the most common value (MCV) for all three columns.
INSERT INTO planner_trap (col_a, col_b, col_c, sort_col)
SELECT
    (pow(random(), 5) * 100)::integer + 1,
    (pow(random(), 5) * 10)::integer + 1,
    (pow(random(), 5) * 50)::integer + 1,
    i
FROM generate_series(1, 100000) AS i;

-- 3. Create indexes
CREATE INDEX idx_planner_trap_sort ON planner_trap(sort_col);
CREATE INDEX idx_planner_trap_a_b_c ON planner_trap(col_a, col_b, col_c);

-- 4. Make the exact combination (1, 1, 1) yield zero rows. This ensures it
won't appear
-- in the MCV list, even though the value '1' remains the most common for
each individual column.
UPDATE planner_trap
SET col_a = 2
WHERE col_a = 1 AND col_b = 1 AND col_c = 1;

-- 5. Create MCV statistics
CREATE STATISTICS (mcv) on col_a, col_b, col_c FROM planner_trap;
ANALYZE planner_trap;
```

The planner multiplies the individual selectivities and overestimates the
row count (5909; actual=0):
```
postgres=# EXPLAIN ANALYZE SELECT * FROM planner_trap WHERE col_a = 1 AND
col_b = 1 AND col_c = 1 ORDER BY sort_col DESC LIMIT 10;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..7.72 rows=10 width=16) (actual time=14.632..14.633
rows=0.00 loops=1)
   Buffers: shared hit=16355 dirtied=1
   ->  Index Scan Backward using idx_planner_trap_sort on planner_trap
 (cost=0.29..4386.99 rows=5909 width=16) (actual time=14.630..14.630
rows=0.00 loops=1)
         Filter: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
         Rows Removed by Filter: 100000
         Index Searches: 1
         Buffers: shared hit=16355 dirtied=1
 Planning:
   Buffers: shared hit=33 read=1
 Planning Time: 0.478 ms
 Execution Time: 14.651 ms
```

1. Cap selectivity with the last MCV item's frequency
=====================================================
Applying the last MCV cap here, Postgres estimates 117 rows: 0.001167 *
100000 = 117
```
postgres=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on
(oid = stxoid),
                pg_mcv_list_items(stxdmcv) m WHERE stxname =
'planner_trap_col_a_col_b_col_c_stat' and index = 99;
 index |  values  |  nulls  |       frequency       | base_frequency
-------+----------+---------+-----------------------+----------------
    99 | {1,1,31} | {f,f,f} | 0.0011666666666666668 | 0.001049409536
```

making postgres to choose a better plan:
```
 Limit  (cost=300.72..300.75 rows=10 width=16) (actual time=0.045..0.046
rows=0.00 loops=1)
   Buffers: shared hit=3 read=2
   ->  Sort  (cost=300.72..301.02 rows=117 width=16) (actual
time=0.043..0.044 rows=0.00 loops=1)
         Sort Key: sort_col DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=3 read=2
         ->  Bitmap Heap Scan on planner_trap  (cost=5.78..298.19 rows=117
width=16) (actual time=0.026..0.027 rows=0.00 loops=1)
               Recheck Cond: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
               Buffers: shared read=2
               ->  Bitmap Index Scan on idx_planner_trap_a_b_c
 (cost=0.00..5.76 rows=117 width=0) (actual time=0.019..0.020 rows=0.00
loops=1)
                     Index Cond: ((col_a = 1) AND (col_b = 1) AND (col_c =
1))
                     Index Searches: 1
                     Buffers: shared read=2
 Planning:
   Buffers: shared hit=51 read=8 dirtied=4
 Planning Time: 0.742 ms
 Execution Time: 0.094 ms
```

2. Estimate selectivity as Postgres does for single-column values not in
MCVs
=============================================================================
While that significantly improves estimations, we could mirror what
Postgres already does for individual MCVs. Quote from the official
documentation:
> The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the MCVs:
> That is, add up all the frequencies for the MCVs and subtract them from
one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside the
MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
MCV_list_size)

```
postgres=# SELECT sum(m.frequency) FROM pg_statistic_ext join
pg_statistic_ext_data on (oid = stxoid),
                pg_mcv_list_items(stxdmcv) m WHERE stxname =
'planner_trap_col_a_col_b_col_c_stat';
         sum
---------------------
 0.39456666666666645

postgres=# SELECT stxkeys AS k, stxdndistinct AS nd
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts2';
   k   |                      nd

-------+---------------------------------------------------
 1 2 3 | [...{"attributes": [1, 2, 3], "ndistinct": 8511}]
```

Then (1 - 0.39456666666666645) / (8511 - 100) = 0.000071981; 0.000071981 *
100000 = 7 rows
```
 Limit  (cost=30.30..30.32 rows=7 width=16) (actual time=0.035..0.036
rows=0.00 loops=1)
   Buffers: shared hit=2
   ->  Sort  (cost=30.30..30.32 rows=7 width=16) (actual time=0.033..0.034
rows=0.00 loops=1)
         Sort Key: sort_col DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=2
         ->  Bitmap Heap Scan on planner_trap  (cost=4.38..30.20 rows=7
width=16) (actual time=0.028..0.029 rows=0.00 loops=1)
               Recheck Cond: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
               Buffers: shared hit=2
               ->  Bitmap Index Scan on idx_planner_trap_a_b_c
 (cost=0.00..4.38 rows=7 width=0) (actual time=0.022..0.022 rows=0.00
loops=1)
                     Index Cond: ((col_a = 1) AND (col_b = 1) AND (col_c =
1))
                     Index Searches: 1
                     Buffers: shared hit=2
 Planning Time: 0.192 ms
 Execution Time: 0.057 ms
```

I think this is a cheap way to prevent bad estimations. The storage
overhead of adding an ndistinct field is trivial compared to the MCV list
itself, and the O(1) arithmetic during planning adds no measurable
overhead. I look forward to your feedback before drafting a patch.

Best,
Enrique.


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Extended statistics improvement: multi-column MCV missing values
  2026-05-18 16:09 Extended statistics improvement: multi-column MCV missing values Enrique Sánchez <[email protected]>
@ 2026-05-18 22:13 ` Ilia Evdokimov <[email protected]>
  2026-05-24 18:34   ` Re: Extended statistics improvement: multi-column MCV missing values Enrique Sánchez <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Ilia Evdokimov @ 2026-05-18 22:13 UTC (permalink / raw)
  To: Enrique Sánchez <[email protected]>; [email protected]

Hi Enrique,

On 5/18/26 19:09, Enrique Sánchez wrote:

>
> Postgres only uses multi-column MCVs when the value we are looking for 
> is in the list. If not, it falls back into individual independent 
> statistics to estimate selectivity.
> However, a miss in a multi-column MCV list still yields valuable 
> information that it currently throws away: we know that the 
> combination's frequency is strictly bounded by the frequency of the 
> last (least common) item in that MCV list.

LGTM. If the multicolumn MCV statistics exists and the clause 
combination is absent from the MCV-list, we can use the least frequent 
MCV item as an upper bound. BTW, this only applies to AND-clauses.


>
> 2. Estimate selectivity as Postgres does for single-column values not 
> in MCVs
> =============================================================================
> While that significantly improves estimations, we could mirror what 
> Postgres already does for individual MCVs. Quote from the official 
> documentation:
> > The approach is to use the fact that the value is not in the list, 
> combined with the knowledge of the frequencies for all of the MCVs:
> > That is, add up all the frequencies for the MCVs and subtract them 
> from one, then divide by the number of other distinct values.
>
> To achieve this, we need to store an ndistinct estimation alongside 
> the MCVs that can be used for partial or entire column match.
>
> P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) - 
> MCV_list_size)
>
>
> ...
>
> I think this is a cheap way to prevent bad estimations. The storage 
> overhead of adding an ndistinct field is trivial compared to the MCV 
> list itself, and the O(1) arithmetic during planning adds no 
> measurable overhead. I look forward to your feedback before drafting a 
> patch.

For this, the ndistinct extended statistics already exist. If both MCV 
and ndistinct statistics are present on the same column set, the formula 
is correct. There are already places in the code that compute ndistinct 
for columns without extended ndistinct statistics (see 
estimate_num_groups) - but it is worth thinking carefully about whether 
the added complexity is justified before going down that path.

-- 
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Extended statistics improvement: multi-column MCV missing values
  2026-05-18 16:09 Extended statistics improvement: multi-column MCV missing values Enrique Sánchez <[email protected]>
  2026-05-18 22:13 ` Re: Extended statistics improvement: multi-column MCV missing values Ilia Evdokimov <[email protected]>
@ 2026-05-24 18:34   ` Enrique Sánchez <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Enrique Sánchez @ 2026-05-24 18:34 UTC (permalink / raw)
  To: Ilia Evdokimov <[email protected]>; +Cc: [email protected]

Hi Ilia,

> 1. Cap selectivity with the last MCV item's frequency

I've attached a draft patch. It's split into 4 commits to make it easier to
review. It adds the MCV cap for AND (equality, IN, ANY) clauses. Looking
forward to your feedback.

On 5/19/26 0:13, Ilia Evdokimov wrote:

> Postgres only uses multi-column MCVs when the value we are looking for is
> in the list. If not, it falls back into individual independent statistics
> to estimate selectivity.
> However, a miss in a multi-column MCV list still yields valuable
> information that it currently throws away: we know that the combination's
> frequency is strictly bounded by the frequency of the last (least common)
> item in that MCV list.
>
> LGTM. If the multicolumn MCV statistics exists and the clause combination
> is absent from the MCV-list, we can use the least frequent MCV item as an
> upper bound. BTW, this only applies to AND-clauses.
>
Given that the inclusion-exclusion principle (P(A OR B) = P(A) + P(B) - P(A
AND B)) is used to calculate OR paths, we could use the same cap to improve
the overlap estimation (P(A AND B)).

2. Estimate selectivity as Postgres does for single-column values not in
> MCVs
>
> =============================================================================
> While that significantly improves estimations, we could mirror what
> Postgres already does for individual MCVs. Quote from the official
> documentation:
> > The approach is to use the fact that the value is not in the list,
> combined with the knowledge of the frequencies for all of the MCVs:
> > That is, add up all the frequencies for the MCVs and subtract them from
> one, then divide by the number of other distinct values.
>
> To achieve this, we need to store an ndistinct estimation alongside the
> MCVs that can be used for partial or entire column match.
>
> P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
> MCV_list_size)
> ...
> I think this is a cheap way to prevent bad estimations. The storage
> overhead of adding an ndistinct field is trivial compared to the MCV list
> itself, and the O(1) arithmetic during planning adds no measurable
> overhead. I look forward to your feedback before drafting a patch.
>
> For this, the ndistinct extended statistics already exist. If both MCV and
> ndistinct statistics are present on the same column set, the formula is
> correct. There are already places in the code that compute ndistinct for
> columns without extended ndistinct statistics (see estimate_num_groups) -
> but it is worth thinking carefully about whether the added complexity is
> justified before going down that path.
>
I think that the implementation would look similar to the attached patch.
The only added complexity is getting the ndistinct estimation. There are 2
ways:
- Rely on the extended ndistinct statistic if it exists
- Calculate the ndistinct(col_a, col_b, col_c) statistic as part of the
MCV. The storage cost is negligible compared to the MCV list and it keeps
the MCV statistic complete, so it doesn't need to check if the ndistinct
statistic exists. That would be a change on the MCV struct, meaning a
change in the mcvlist->type.

I can start working on a proposal for this second part. Let me know what
you think.

Best regards,
Enrique.


Attachments:

  [text/x-patch] v1-0004-Extend-multi-column-MCV-cap-to-AND-clauses-inside-OR.patch (8.6K, 3-v1-0004-Extend-multi-column-MCV-cap-to-AND-clauses-inside-OR.patch)
  download | inline diff:
From 145c56ff65785eb559e720f3c04b908a57c61940 Mon Sep 17 00:00:00 2001
From: Enrique Sanchez Cardoso <[email protected]>
Date: Sun, 24 May 2026 19:09:26 +0200
Subject: [PATCH 4/4] Extend multi-column MCV cap to AND clauses inside OR
 expressions

---
 src/backend/statistics/extended_stats.c       | 12 +++-
 src/backend/statistics/mcv.c                  | 68 +++++++++++++++----
 .../statistics/extended_stats_internal.h      |  3 +-
 src/test/regress/expected/stats_ext.out       | 16 ++++-
 src/test/regress/sql/stats_ext.sql            |  8 ++-
 5 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index da6f6315698..fc6f7905bb3 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1915,7 +1915,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 							overlap_basesel,
 							mcv_totalsel,
 							clause_sel,
-							overlap_sel;
+							overlap_sel,
+							clause_cap;
 
 				/*
 				 * "Simple" selectivity of the next clause and its overlap
@@ -1945,7 +1946,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 													&mcv_basesel,
 													&overlap_mcvsel,
 													&overlap_basesel,
-													&mcv_totalsel);
+													&mcv_totalsel,
+													&clause_cap);
 
 				/*
 				 * Combine the simple and multi-column estimates.
@@ -1959,11 +1961,17 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 				if (bms_is_member(listidx, simple_clauses))
 					clause_sel = simple_sel;
 				else
+				{
 					clause_sel = mcv_combine_selectivities(simple_sel,
 														   mcv_sel,
 														   mcv_basesel,
 														   mcv_totalsel);
 
+					/* Cap the contribution of values not found in the MCV. */
+					if (clause_sel > clause_cap)
+						clause_sel = clause_cap;
+				}
+
 				overlap_sel = mcv_combine_selectivities(overlap_simple_sel,
 														overlap_mcvsel,
 														overlap_basesel,
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 62761c58e33..df75087391c 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1577,6 +1577,40 @@ mcv_cap_multiplier(List *clauses)
 	return multiplier;
 }
 
+/*
+ * mcv_compute_cap
+ *		Compute a selectivity cap based on the least common MCV frequency.
+ *
+ * When one equality/IN clause covers each MCV dimension, value combinations
+ * not found in the MCV can't be more frequent than the least common tracked
+ * combination.  The cap is: matched MCV frequency plus the number of
+ * non-MCV combinations times the least common MCV frequency.
+ *
+ * Returns 1.0 (no cap) when the clauses don't fully cover all dimensions
+ * or contain unsupported clause types.
+ */
+static Selectivity
+mcv_compute_cap(MCVList *mcv, List *clauses, Selectivity mcv_sel,
+					 int64 matched_count)
+{
+	int64		cap_mult;
+	int64		non_mcv_mult;
+	Selectivity cap;
+
+	if (list_length(clauses) != mcv->ndimensions)
+		return 1.0;
+
+	cap_mult = mcv_cap_multiplier(clauses);
+	non_mcv_mult = cap_mult - matched_count;
+
+	if (non_mcv_mult <= 0)
+		return 1.0;
+
+	cap = mcv_sel + non_mcv_mult * mcv->items[mcv->nitems - 1].frequency;
+	CLAMP_PROBABILITY(cap);
+	return cap;
+}
+
 /*
  * match the attribute/expression to a dimension of the statistic
  *
@@ -2144,17 +2178,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	 * combination is not among the most common, so it can't be more frequent
 	 * than the least common tracked combination.
 	 */
-	if (mcv->ndimensions == list_length(clauses))
-	{
-		int64		cap_mult = mcv_cap_multiplier(clauses);
-		int64		non_mcv_mult = cap_mult - matched_count;
-
-		if (non_mcv_mult > 0)
-		{
-			*cap = s + non_mcv_mult * mcv->items[mcv->nitems - 1].frequency;
-			CLAMP_PROBABILITY(*cap);
-		}
-	}
+	*cap = mcv_compute_cap(mcv, clauses, s, matched_count);
 
 	return s;
 }
@@ -2202,11 +2226,16 @@ Selectivity
 mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 						  MCVList *mcv, Node *clause, bool **or_matches,
 						  Selectivity *basesel, Selectivity *overlap_mcvsel,
-						  Selectivity *overlap_basesel, Selectivity *totalsel)
+						  Selectivity *overlap_basesel, Selectivity *totalsel,
+						  Selectivity *clause_cap)
 {
 	Selectivity s = 0.0;
 	bool	   *new_matches;
 	int			i;
+	int64		matched_count = 0;
+
+	/* default: no cap on clause selectivity */
+	*clause_cap = 1.0;
 
 	/* build the OR-matches bitmap, if not built already */
 	if (*or_matches == NULL)
@@ -2233,6 +2262,7 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 		{
 			s += mcv->items[i].frequency;
 			*basesel += mcv->items[i].base_frequency;
+			matched_count++;
 
 			if ((*or_matches)[i])
 			{
@@ -2247,6 +2277,20 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	pfree(new_matches);
 
+	/*
+	 * When there is one equality/IN clause per MCV dimension, cap the
+	 * contribution of value combinations not found in the MCV.  Each such
+	 * combination is not among the most common, so it can't be more frequent
+	 * than the least common tracked combination.
+	 */
+	if (is_andclause(clause))
+	{
+		BoolExpr   *bexpr = (BoolExpr *) clause;
+
+		*clause_cap = mcv_compute_cap(mcv, bexpr->args, s,
+												matched_count);
+	}
+
 	return s;
 }
 
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 01b5f67b843..10f41f87564 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -140,6 +140,7 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *basesel,
 											 Selectivity *overlap_mcvsel,
 											 Selectivity *overlap_basesel,
-											 Selectivity *totalsel);
+											 Selectivity *totalsel,
+											 Selectivity *clause_cap);
 
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 7ea244f7851..c712679b573 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2931,7 +2931,7 @@ DROP TABLE mcv_lists_partial;
 -- P(a=0)=0.5 and P(b=0)=0.5, so the independence estimate is 0.25 * N.
 -- After building MCV statistics the cap limits the combined estimate to the
 -- least-common MCV frequency, eliminating most of the over-estimation.
-CREATE TABLE mcv_cap (a INT, b INT) WITH (autovacuum_enabled = off);
+CREATE TABLE mcv_cap (a INT, b INT, c INT DEFAULT 0) WITH (autovacuum_enabled = off);
 INSERT INTO mcv_cap
     SELECT 0, b FROM generate_series(1, 99) b, generate_series(1, 100) r;
 INSERT INTO mcv_cap
@@ -2979,6 +2979,20 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (
        200 |      0
 (1 row)
 
+-- partial MCV match inside OR (a=0, b=99)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE c = 1 OR (a = 0 AND b IN (0, 99))');
+ estimated | actual 
+-----------+--------
+       200 |    100
+(1 row)
+
+-- no MCV match inside OR
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE c = 1 OR (a = 0 AND b = 0)');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
 DROP TABLE mcv_cap;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 8e0b8c0eb5c..926dfaa4e6d 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1468,7 +1468,7 @@ DROP TABLE mcv_lists_partial;
 -- P(a=0)=0.5 and P(b=0)=0.5, so the independence estimate is 0.25 * N.
 -- After building MCV statistics the cap limits the combined estimate to the
 -- least-common MCV frequency, eliminating most of the over-estimation.
-CREATE TABLE mcv_cap (a INT, b INT) WITH (autovacuum_enabled = off);
+CREATE TABLE mcv_cap (a INT, b INT, c INT DEFAULT 0) WITH (autovacuum_enabled = off);
 
 INSERT INTO mcv_cap
     SELECT 0, b FROM generate_series(1, 99) b, generate_series(1, 100) r;
@@ -1497,6 +1497,12 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (
 -- no MCV match
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
 
+-- partial MCV match inside OR (a=0, b=99)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE c = 1 OR (a = 0 AND b IN (0, 99))');
+
+-- no MCV match inside OR
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE c = 1 OR (a = 0 AND b = 0)');
+
 DROP TABLE mcv_cap;
 
 -- check the ability to use multiple MCV lists
-- 
2.43.0



  [text/x-patch] v1-0001-Cap-selectivity-when-values-are-not-in-multi-column-.patch (9.1K, 4-v1-0001-Cap-selectivity-when-values-are-not-in-multi-column-.patch)
  download | inline diff:
From 42a36e6d98d7642e87500de8f138e1f54160fe55 Mon Sep 17 00:00:00 2001
From: Enrique Sanchez Cardoso <[email protected]>
Date: Sun, 24 May 2026 01:03:14 +0200
Subject: [PATCH 1/4] Cap selectivity when values are not in multi-column mcv

Selectivity can't be > last MCV item (least common) selectivity when
they are AND clauses and cover all the MCV dimensions.
---
 src/backend/statistics/extended_stats.c       | 11 +++-
 src/backend/statistics/mcv.c                  | 43 ++++++++++++++-
 .../statistics/extended_stats_internal.h      |  3 +-
 src/test/regress/expected/stats_ext.out       | 52 +++++++++++++++++++
 src/test/regress/sql/stats_ext.sql            | 34 ++++++++++++
 5 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 2b83355d26e..f8c38653bf9 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1989,6 +1989,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 						mcv_sel,
 						mcv_basesel,
 						mcv_totalsel,
+						mcv_cap,
 						stat_sel;
 
 			/*
@@ -2006,7 +2007,8 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 			mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses,
 												 varRelid, jointype, sjinfo,
 												 rel, &mcv_basesel,
-												 &mcv_totalsel);
+												 &mcv_totalsel,
+												 &mcv_cap);
 
 			/* Combine the simple and multi-column estimates. */
 			stat_sel = mcv_combine_selectivities(simple_sel,
@@ -2014,6 +2016,13 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 												 mcv_basesel,
 												 mcv_totalsel);
 
+			/*
+			 * Cap to the least common MCV frequency when no MCV items
+			 * matched.
+			 */
+			if (stat_sel > mcv_cap)
+				stat_sel = mcv_cap;
+
 			/* Factor this into the overall result */
 			sel *= stat_sel;
 		}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 0b7da605a4c..df70d00cc3d 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -24,6 +24,7 @@
 #include "statistics/statistics.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
@@ -1523,6 +1524,32 @@ pg_mcv_list_send(PG_FUNCTION_ARGS)
 	return byteasend(fcinfo);
 }
 
+/*
+ * mcv_is_all_equality_clauses
+ *		Check if all clauses are simple equality conditions (OpExpr with eqsel
+ *		restriction estimator).  This mirrors the check done by
+ *		dependency_is_compatible_clause() in dependencies.c.
+ */
+static bool
+mcv_is_all_equality_clauses(List *clauses)
+{
+	ListCell   *lc;
+
+	foreach(lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		if (!is_opclause(clause) ||
+			get_oprrest(((OpExpr *) clause)->opno) != F_EQSEL)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * match the attribute/expression to a dimension of the statistic
  *
@@ -2047,7 +2074,8 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   List *clauses, int varRelid,
 						   JoinType jointype, SpecialJoinInfo *sjinfo,
 						   RelOptInfo *rel,
-						   Selectivity *basesel, Selectivity *totalsel)
+						   Selectivity *basesel, Selectivity *totalsel,
+						   Selectivity *cap)
 {
 	int			i;
 	MCVList    *mcv;
@@ -2057,6 +2085,9 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	/* match/mismatch bitmap for each MCV item */
 	bool	   *matches = NULL;
 
+	/* default: no cap on combined selectivity */
+	*cap = 1.0;
+
 	/* load the MCV list stored in the statistics object */
 	mcv = statext_mcv_load(stat->statOid, rte->inh);
 
@@ -2078,6 +2109,16 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 		}
 	}
 
+	/*
+	 * When no MCV item matched and there is one equality clause per MCV
+	 * dimension, cap the selectivity to the least common MCV frequency. The
+	 * combination is not among the most common, so it can't be more frequent
+	 * than the least common tracked combination.
+	 */
+	if (s == 0.0 && mcv->ndimensions == list_length(clauses) &&
+		mcv_is_all_equality_clauses(clauses))
+		*cap = mcv->items[mcv->nitems - 1].frequency;
+
 	return s;
 }
 
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index c775442f2ee..01b5f67b843 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -129,7 +129,8 @@ extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  SpecialJoinInfo *sjinfo,
 											  RelOptInfo *rel,
 											  Selectivity *basesel,
-											  Selectivity *totalsel);
+											  Selectivity *totalsel,
+											  Selectivity *cap);
 
 extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 StatisticExtInfo *stat,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 37070c1a896..1ca26669bb1 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2928,6 +2928,58 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0
 (1 row)
 
 DROP TABLE mcv_lists_partial;
+-- P(a=0)=0.5 and P(b=0)=0.5, so the independence estimate is 0.25 * N.
+-- After building MCV statistics the cap limits the combined estimate to the
+-- least-common MCV frequency, eliminating most of the over-estimation.
+CREATE TABLE mcv_cap (a INT, b INT) WITH (autovacuum_enabled = off);
+INSERT INTO mcv_cap
+    SELECT 0, b FROM generate_series(1, 99) b, generate_series(1, 100) r;
+INSERT INTO mcv_cap
+    SELECT a, 0 FROM generate_series(1, 99) a, generate_series(1, 100) r;
+ANALYZE mcv_cap;
+-- without MCV statistics: independence gives 0.5 * 0.5 * 19800 = 4950 rows
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0');
+ estimated | actual 
+-----------+--------
+      4950 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
+ estimated | actual 
+-----------+--------
+      5000 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
+ estimated | actual 
+-----------+--------
+      4950 |      0
+(1 row)
+
+CREATE STATISTICS mcv_cap_stats (mcv) ON a, b FROM mcv_cap;
+ANALYZE mcv_cap;
+-- with MCV statistics: bounded by least MCV frequency
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- IN/ANY equality clauses are not supported, partial MCV match (a=0, b=99)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
+ estimated | actual 
+-----------+--------
+      5050 |    100
+(1 row)
+
+-- IN/ANY equality clauses are not supported, no MCV match
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
+ estimated | actual 
+-----------+--------
+      4950 |      0
+(1 row)
+
+DROP TABLE mcv_cap;
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 3cc6012b822..0f67363cd6d 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1465,6 +1465,40 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0
 
 DROP TABLE mcv_lists_partial;
 
+-- P(a=0)=0.5 and P(b=0)=0.5, so the independence estimate is 0.25 * N.
+-- After building MCV statistics the cap limits the combined estimate to the
+-- least-common MCV frequency, eliminating most of the over-estimation.
+CREATE TABLE mcv_cap (a INT, b INT) WITH (autovacuum_enabled = off);
+
+INSERT INTO mcv_cap
+    SELECT 0, b FROM generate_series(1, 99) b, generate_series(1, 100) r;
+
+INSERT INTO mcv_cap
+    SELECT a, 0 FROM generate_series(1, 99) a, generate_series(1, 100) r;
+
+ANALYZE mcv_cap;
+
+-- without MCV statistics: independence gives 0.5 * 0.5 * 19800 = 4950 rows
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
+
+CREATE STATISTICS mcv_cap_stats (mcv) ON a, b FROM mcv_cap;
+ANALYZE mcv_cap;
+
+-- with MCV statistics: bounded by least MCV frequency
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0');
+
+-- IN/ANY equality clauses are not supported, partial MCV match (a=0, b=99)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
+
+-- IN/ANY equality clauses are not supported, no MCV match
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
+
+DROP TABLE mcv_cap;
+
 -- check the ability to use multiple MCV lists
 CREATE TABLE mcv_lists_multi (
 	a INTEGER,
-- 
2.43.0



  [text/x-patch] v1-0002-Add-support-for-IN-ANY-clauses-in-multi-column-MCV-c.patch (4.9K, 5-v1-0002-Add-support-for-IN-ANY-clauses-in-multi-column-MCV-c.patch)
  download | inline diff:
From 3bdd48382fdcf0bd8eaaa6a328cde8a47eeee1ac Mon Sep 17 00:00:00 2001
From: Enrique Sanchez Cardoso <[email protected]>
Date: Sun, 24 May 2026 13:32:24 +0200
Subject: [PATCH 2/4] Add support for IN/ANY clauses in multi-column MCV cap

Extend the MCV-based cap to handle IN and ANY clauses, allowing the
selectivity cap to apply when these clauses are used alongside equality
filters.
---
 src/backend/statistics/mcv.c            | 60 +++++++++++++++++++------
 src/test/regress/expected/stats_ext.out |  6 +--
 src/test/regress/sql/stats_ext.sql      |  4 +-
 3 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index df70d00cc3d..2e75f19d8bd 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1525,14 +1525,19 @@ pg_mcv_list_send(PG_FUNCTION_ARGS)
 }
 
 /*
- * mcv_is_all_equality_clauses
- *		Check if all clauses are simple equality conditions (OpExpr with eqsel
- *		restriction estimator).  This mirrors the check done by
- *		dependency_is_compatible_clause() in dependencies.c.
+ * mcv_cap_multiplier
+ *		Compute a multiplier for capping combined selectivity to the least
+ *		common MCV frequency when no MCV items matched.
+ *
+ * Returns 0 if the cap should not be applied (unsupported clause types).
+ * Returns >= 1 as the number of distinct value combinations the clauses
+ * could match: 1 for each equality clause, N for each IN/ANY clause with
+ * N elements.
  */
-static bool
-mcv_is_all_equality_clauses(List *clauses)
+static int64
+mcv_cap_multiplier(List *clauses)
 {
+	int64		multiplier = 1;
 	ListCell   *lc;
 
 	foreach(lc, clauses)
@@ -1542,12 +1547,34 @@ mcv_is_all_equality_clauses(List *clauses)
 		if (IsA(clause, RestrictInfo))
 			clause = (Node *) ((RestrictInfo *) clause)->clause;
 
-		if (!is_opclause(clause) ||
-			get_oprrest(((OpExpr *) clause)->opno) != F_EQSEL)
-			return false;
+		if (is_opclause(clause))
+		{
+			/* Simple equality: factor 1 */
+			if (get_oprrest(((OpExpr *) clause)->opno) != F_EQSEL)
+				return 0;
+		}
+		else if (IsA(clause, ScalarArrayOpExpr))
+		{
+			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
+			Node	   *arg;
+			ArrayType  *arr;
+
+			/* Only ANY/IN with equality operator */
+			if (!saop->useOr || get_oprrest(saop->opno) != F_EQSEL)
+				return 0;
+
+			arg = (Node *) lsecond(saop->args);
+			if (!IsA(arg, Const) || ((Const *) arg)->constisnull)
+				return 0;
+
+			arr = DatumGetArrayTypeP(((Const *) arg)->constvalue);
+			multiplier *= ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+		}
+		else
+			return 0;			/* unsupported clause type */
 	}
 
-	return true;
+	return multiplier;
 }
 
 /*
@@ -2115,9 +2142,16 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 	 * combination is not among the most common, so it can't be more frequent
 	 * than the least common tracked combination.
 	 */
-	if (s == 0.0 && mcv->ndimensions == list_length(clauses) &&
-		mcv_is_all_equality_clauses(clauses))
-		*cap = mcv->items[mcv->nitems - 1].frequency;
+	if (s == 0.0 && mcv->ndimensions == list_length(clauses))
+	{
+		int64		cap_mult = mcv_cap_multiplier(clauses);
+
+		if (cap_mult > 0)
+		{
+			*cap = cap_mult * mcv->items[mcv->nitems - 1].frequency;
+			CLAMP_PROBABILITY(*cap);
+		}
+	}
 
 	return s;
 }
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 1ca26669bb1..82ad9edb568 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2965,18 +2965,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0'
        100 |      0
 (1 row)
 
--- IN/ANY equality clauses are not supported, partial MCV match (a=0, b=99)
+-- partial MCV match (a=0, b=99)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
  estimated | actual 
 -----------+--------
       5050 |    100
 (1 row)
 
--- IN/ANY equality clauses are not supported, no MCV match
+-- no MCV match
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
  estimated | actual 
 -----------+--------
-      4950 |      0
+       200 |      0
 (1 row)
 
 DROP TABLE mcv_cap;
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 0f67363cd6d..8e0b8c0eb5c 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1491,10 +1491,10 @@ ANALYZE mcv_cap;
 -- with MCV statistics: bounded by least MCV frequency
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0');
 
--- IN/ANY equality clauses are not supported, partial MCV match (a=0, b=99)
+-- partial MCV match (a=0, b=99)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
 
--- IN/ANY equality clauses are not supported, no MCV match
+-- no MCV match
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 100)');
 
 DROP TABLE mcv_cap;
-- 
2.43.0



  [text/x-patch] v1-0003-Extend-multicolumn-MCV-cap-to-partial-IN-ANY-matches.patch (3.5K, 6-v1-0003-Extend-multicolumn-MCV-cap-to-partial-IN-ANY-matches.patch)
  download | inline diff:
From fc369280662c9183ccc50ac4ded144acadd84c15 Mon Sep 17 00:00:00 2001
From: Enrique Sanchez Cardoso <[email protected]>
Date: Sun, 24 May 2026 14:40:39 +0200
Subject: [PATCH 3/4] Extend multicolumn MCV cap to partial IN/ANY matches

---
 src/backend/statistics/extended_stats.c |  5 +----
 src/backend/statistics/mcv.c            | 17 ++++++++++-------
 src/test/regress/expected/stats_ext.out |  2 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index f8c38653bf9..da6f6315698 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2016,10 +2016,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 												 mcv_basesel,
 												 mcv_totalsel);
 
-			/*
-			 * Cap to the least common MCV frequency when no MCV items
-			 * matched.
-			 */
+			/* Cap the contribution of values not found in the MCV. */
 			if (stat_sel > mcv_cap)
 				stat_sel = mcv_cap;
 
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 2e75f19d8bd..62761c58e33 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1526,8 +1526,8 @@ pg_mcv_list_send(PG_FUNCTION_ARGS)
 
 /*
  * mcv_cap_multiplier
- *		Compute a multiplier for capping combined selectivity to the least
- *		common MCV frequency when no MCV items matched.
+ * 		Compute a multiplier for capping combined selectivity to the least
+ * 		common MCV frequency.
  *
  * Returns 0 if the cap should not be applied (unsupported clause types).
  * Returns >= 1 as the number of distinct value combinations the clauses
@@ -2105,6 +2105,7 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 						   Selectivity *cap)
 {
 	int			i;
+	int64		matched_count = 0;
 	MCVList    *mcv;
 	Selectivity s = 0.0;
 	RangeTblEntry *rte = root->simple_rte_array[rel->relid];
@@ -2133,22 +2134,24 @@ mcv_clauselist_selectivity(PlannerInfo *root, StatisticExtInfo *stat,
 		{
 			*basesel += mcv->items[i].base_frequency;
 			s += mcv->items[i].frequency;
+			matched_count++;
 		}
 	}
 
 	/*
-	 * When no MCV item matched and there is one equality clause per MCV
-	 * dimension, cap the selectivity to the least common MCV frequency. The
+	 * When there is one equality/IN clause per MCV dimension, cap the
+	 * contribution of value combinations not found in the MCV.  Each such
 	 * combination is not among the most common, so it can't be more frequent
 	 * than the least common tracked combination.
 	 */
-	if (s == 0.0 && mcv->ndimensions == list_length(clauses))
+	if (mcv->ndimensions == list_length(clauses))
 	{
 		int64		cap_mult = mcv_cap_multiplier(clauses);
+		int64		non_mcv_mult = cap_mult - matched_count;
 
-		if (cap_mult > 0)
+		if (non_mcv_mult > 0)
 		{
-			*cap = cap_mult * mcv->items[mcv->nitems - 1].frequency;
+			*cap = s + non_mcv_mult * mcv->items[mcv->nitems - 1].frequency;
 			CLAMP_PROBABILITY(*cap);
 		}
 	}
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 82ad9edb568..7ea244f7851 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2969,7 +2969,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b = 0'
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_cap WHERE a = 0 AND b IN (0, 99)');
  estimated | actual 
 -----------+--------
-      5050 |    100
+       200 |    100
 (1 row)
 
 -- no MCV match
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2026-05-24 18:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-18 16:09 Extended statistics improvement: multi-column MCV missing values Enrique Sánchez <[email protected]>
2026-05-18 22:13 ` Ilia Evdokimov <[email protected]>
2026-05-24 18:34   ` Enrique Sánchez <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox