Re: Implement missing join selectivity estimation for range types

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Haibo Yan <[email protected]>
To: jian he <[email protected]>
Cc: vignesh C <[email protected]>
Cc: Schoemans Maxime <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: Damir Belyalov <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: SAKR Mahmoud <[email protected]>
Cc: Diogo Repas <[email protected]>
Cc: LUO Zhicheng <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Andrey Lepikhov <[email protected]>
Subject: Re: Implement missing join selectivity estimation for range types
Date: Mon, 6 Apr 2026 16:32:02 -0700
Message-ID: <CABXr29HUT2m7YPpB4gOZHphYmwpwuEWFZ_p6oeyouhWw8Zc1Tw@mail.gmail.com> (raw)
In-Reply-To: <CACJufxFhES7untAPmxJDrB-y_pNgah_1HNbXjcOj6g=yxh3kdA@mail.gmail.com>
References: <CAB4o4asMq3k6HN9WfDsssQ5DDVfAziB4TpiFJ8RBJgZTVuwC7g@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<CAB4o4aud47V_iRyWtA8+ZAmdXDjCF165R73AeCjx2RL0nzQzHA@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<CAB4o4asvPN=NT7KvS9zVQjZbdsiRW5t8aQctEkW7mxc4hbBxVQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CALH1Lgurg1y1DTeFOXOkpP=+X7saVGCh8gSjDLoCBOcFFWhz-A@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CALDaNm29NMhpPvoM49qqfYzVcRJq0CPTd-xSUcTF7RZQO2TASQ@mail.gmail.com>
	<[email protected]>
	<CALDaNm3A4j7=rmY128Eb3v8zi=hHAnGizUcbSX5ekZKHy-wmMA@mail.gmail.com>
	<CACJufxFhES7untAPmxJDrB-y_pNgah_1HNbXjcOj6g=yxh3kdA@mail.gmail.com>

On Mon, Apr 6, 2026 at 3:12 PM jian he <[email protected]> wrote:

> I cannot figure out why it aborts.
>
> as Tom mentioned in upthread about the test cases.
> similar to src/test/regress/sql/stats_ext.sql check_estimated_rows
> function.
> we can test it by something:
>
> create or replace function check_estimated_rows(text) returns table (ok
> bool)
> language plpgsql as
> $$
> declare
>     ln text;
>     tmp text[];
>     first_row bool := true;
> begin
>     for ln in
>         execute format('explain analyze %s', $1)
>     loop
>         if first_row then
>             first_row := false;
>             tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
>             return query select 0.2 < tmp[1]::float8 / tmp[2]::float8
> and tmp[1]::float8 / tmp[2]::float8 < 5;
>         end if;
>     end loop;
> end;
> $$;
>
> select * from check_estimated_rows($$select * from test_range_join_1,
> test_range_join_2 where ir1 && ir2$$);
> select * from check_estimated_rows($$select * from test_range_join_1,
> test_range_join_2 where ir1 << ir2$$);
> select * from check_estimated_rows($$select * from test_range_join_1,
> test_range_join_2 where ir1 >> ir2$$);
>
> Do you need 3 tables to do the test? because we need to actually run
> the query then compare the estimated row
> and actually returned rows.
> If you really execute the query with 3 table joins, it will take a lot of
> time.
> So two tables join with where quql should be fine?
>
> /* Fast-forwards i and j to start of iteration */
> + for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0;
> i++);
> + for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0;
> j++);
> +
> + /* Do the estimation on overlapping regions */
> + while (i < nhist1 && j < nhist2)
> + {
> + double cur_sel1,
> + cur_sel2;
> + RangeBound cur_sync;
> +
> + if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) < 0)
> + cur_sync = hist1[i++];
> + else if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) > 0)
> + cur_sync = hist2[j++];
> + else
> + {
> + /* If equal, skip one */
> + cur_sync = hist1[i];
> +
>
> this part range_cmp_bound_values "if else if" part computed twice, you
> can just do
> `
> int cmp;
> cmp = range_cmp_bound_values(typcache, &hist1[i], &hist2[j]);
> if cmp <0  then
> else if cmp > 0 then
> else then
> `
>
> also. I think you can put the following into  main while loop.
> + for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0;
> i++);
> + for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0;
> j++);
>
> split range and multirange into 2 patches might be a good idea.
> seems: same function (calc_hist_join_selectivity) with same function
> signature in src/backend/utils/adt/multirangetypes_selfuncs.c
> and src/backend/utils/adt/rangetypes_selfuncs.c,
> previously mail complaints not resolved.
>
>
>
> Hi,all
I'd like to revive the discussion on improving selectivity estimation for
range/range joins.
Attached is the v5 patch which teaches rangejoinsel to use range bound
histograms for estimating the <<, >>, and && operators. Currently, the
planner often falls back to a hardcoded default (like 0.005), which can
lead to poor join ordering in complex queries.
In this version, I have intentionally excluded &< and &>.
a &< b essentially maps to upper(a) <= upper(b).
a &> b essentially maps to lower(a) >= lower(b).
Since these operators include equality (<= / >=) rather than strict
inequality (< / >), their estimation is slightly more nuanced. I believe
focusing on the strict inequality and overlap operators first allows us to
deliver a clean, converged, and significantly beneficial improvement. We
can discuss the best approach for the remaining operators once this
foundation is in place.
Test Results
My local tests show that the planner now correctly identifies cases with
zero or full selectivity, which were previously misestimated.
----------------------------------------------------------------------
CREATE TABLE t1 (id int, r int4range);
CREATE TABLE t2 (id int, r int4range);
INSERT INTO t1 SELECT g, int4range(g * 2 - 1, g * 2) FROM
generate_series(1, 1000) g;
INSERT INTO t2 SELECT g, int4range(10000 + g * 200, 10000 + g * 200 + 100)
FROM generate_series(1, 1000) g;
ANALYZE t1, t2;
----------------------------------------------------------------------

Real Selectivity:
----------------------------------------------------------------------
SELECT
    avg((t1.r << t2.r)::int) AS p_ll, -- Expected: 1.0
    avg((t1.r >> t2.r)::int) AS p_rr, -- Expected: 0.0
    avg((t1.r && t2.r)::int) AS p_ov  -- Expected: 0.0
FROM t1, t2;
-- Result: p_ll = 1.0, p_rr = 0.0, p_ov = 0.0
----------------------------------------------------------------------

Planner Improvements:
With the patch, the EXPLAIN output reflects these probabilities accurately:
For << (High Selectivity):
The planner correctly estimates rows=1000000 (1000 * 1000 * 1.0).
----------------------------------------------------------------------
Nested Loop (cost=29.50..15080.75 rows=1000000 width=61)
  Join Filter: (t1.r << t2.r)
----------------------------------------------------------------------
For && and >> (Near-Zero Selectivity):
The planner now correctly predicts rows=1 instead of using the default
multiplier.
----------------------------------------------------------------------
Nested Loop (cost=0.00..15036.50 rows=1 width=36)
  Join Filter: (t1.r && t2.r)
This improved estimation allows the optimizer to make much better decisions
regarding join order and nesting when range columns are involved.
I look forward to your feedback.
Regards,
Haibo


Attachments:

  [application/octet-stream] v5-0001-Improve-range-range-join-selectivity-estimation.patch (20.7K, 3-v5-0001-Improve-range-range-join-selectivity-estimation.patch)
  download | inline diff:
From ac1a0f23f7deb3a9a17bb8e150601ee9aaaafa46 Mon Sep 17 00:00:00 2001
From: Haibo Yan <[email protected]>
Date: Mon, 6 Apr 2026 09:30:10 -0700
Subject: [PATCH v5] Improve range/range join selectivity estimation

Teach rangejoinsel to estimate selected range/range join operators using
range histogram statistics instead of falling back to fixed defaults.

This improves planner row estimates for operators such as <<, >>, &&,
especially when the two range columns have clearly separated or strongly
overlapping distributions.

Regression tests cover plan changes for representative range join cases.
---
 src/backend/utils/adt/rangetypes_selfuncs.c | 301 ++++++++++++++++++++
 src/include/catalog/pg_operator.dat         |   6 +-
 src/include/catalog/pg_proc.dat             |   5 +
 src/test/regress/expected/rangetypes.out    | 114 ++++++++
 src/test/regress/sql/rangetypes.sql         |  53 ++++
 5 files changed, 476 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
index 75f1e7567d5..97ae19fbcd2 100644
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -1221,3 +1221,304 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * Estimate join selectivity P(X < Y) using rangebound histograms.
+ *
+ * Based on: Diogo Repas, Zhicheng Luo, Maxime Schoemans, Mahmoud Sakr, 2022
+ * "Selectivity Estimation of Inequality Joins In Databases"
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * hist1 and hist2 are arrays of RangeBound entries from the bounds histograms
+ * of two range-typed attributes X and Y, respectively.  Each array has at
+ * least 2 entries (one histogram bin).  The entries carry full bound metadata
+ * (lower/upper flag, inclusive/exclusive), and all comparisons use
+ * range_cmp_bounds() so that bound semantics are preserved.
+ *
+ * The algorithm models each attribute's distribution as a piecewise function
+ * derived from its histogram, then computes:
+ *   P(X < Y) = 0.5 * sum( (F_X(prev) + F_X(cur)) * (F_Y(cur) - F_Y(prev)) )
+ * by parallel-scanning both histograms.
+ *
+ * The initial fast-forward loops skip histogram entries that fall entirely
+ * before the other histogram's range, so the main loop only processes the
+ * overlapping region.  Bounds checks are required because the histograms may
+ * be completely disjoint (e.g., all of X is below all of Y).
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+						   const RangeBound *hist1, int nhist1,
+						   const RangeBound *hist2, int nhist2)
+{
+	int			i,
+				j;
+	double		selectivity = 0.0;
+	double		prev_sel1 = -1.0;	/* negative sentinel skips first iter */
+	double		prev_sel2 = 0.0;
+
+	Assert(nhist1 > 1);
+	Assert(nhist2 > 1);
+
+	/*
+	 * Fast-forward past hist1 entries that are entirely below hist2[0], and
+	 * vice versa.  Bounds checks prevent out-of-bounds access when the
+	 * histograms are fully disjoint.
+	 */
+	for (i = 0; i < nhist1 &&
+		 range_cmp_bounds(typcache, &hist1[i], &hist2[0]) < 0; i++)
+		;
+	for (j = 0; j < nhist2 &&
+		 range_cmp_bounds(typcache, &hist2[j], &hist1[0]) < 0; j++)
+		;
+
+	/*
+	 * Handle fully-separated histograms.  When all bounds in hist1 are below
+	 * all bounds in hist2, P(X < Y) is ~1.0.  When all of hist2 is below
+	 * hist1, P(X < Y) is ~0.0.  We return immediately rather than falling
+	 * into the overlap walk with invalid indices.
+	 */
+	if (i >= nhist1)
+		return 1.0;
+	if (j >= nhist2)
+		return 0.0;
+
+	/* Walk the overlapping region of both histograms */
+	while (i < nhist1 && j < nhist2)
+	{
+		double		cur_sel1,
+					cur_sel2;
+		RangeBound	cur_sync;
+
+		if (range_cmp_bounds(typcache, &hist1[i], &hist2[j]) < 0)
+			cur_sync = hist1[i++];
+		else if (range_cmp_bounds(typcache, &hist1[i], &hist2[j]) > 0)
+			cur_sync = hist2[j++];
+		else
+		{
+			/* Equal bounds: advance both */
+			cur_sync = hist1[i];
+			i++;
+			j++;
+		}
+		cur_sel1 = calc_hist_selectivity_scalar(typcache, &cur_sync,
+												hist1, nhist1, false);
+		cur_sel2 = calc_hist_selectivity_scalar(typcache, &cur_sync,
+												hist2, nhist2, false);
+
+		/* Skip the first iteration (no previous point yet) */
+		if (prev_sel1 >= 0)
+			selectivity += (prev_sel1 + cur_sel1) * (cur_sel2 - prev_sel2);
+
+		prev_sel1 = cur_sel1;
+		prev_sel2 = cur_sel2;
+	}
+
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2;
+
+	/* Include remainder of hist2 if hist1 was exhausted first */
+	if (j < nhist2)
+		selectivity += 1 - prev_sel2;
+
+	return selectivity;
+}
+
+/*
+ * rangejoinsel -- join selectivity for range-vs-range operators
+ *
+ * Supports: <<, >>, &&
+ * These operators map directly to strict bound comparisons P(X < Y),
+ * which calc_hist_join_selectivity() estimates from bound histograms.
+ * Other range operators are left to their existing generic estimators.
+ */
+Datum
+rangejoinsel(PG_FUNCTION_ARGS)
+{
+	PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+	Oid			operator = PG_GETARG_OID(1);
+	List	   *args = (List *) PG_GETARG_POINTER(2);
+	SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4);
+	VariableStatData vardata1;
+	VariableStatData vardata2;
+	Selectivity selec;
+	AttStatsSlot hist1;
+	AttStatsSlot hist2;
+	AttStatsSlot sslot;
+	bool		have_hist1 = false;
+	bool		have_hist2 = false;
+	TypeCacheEntry *typcache;
+	Form_pg_statistic stats1;
+	Form_pg_statistic stats2;
+	double		empty_frac1;
+	double		empty_frac2;
+	double		null_frac1;
+	double		null_frac2;
+	int			nhist1;
+	int			nhist2;
+	RangeBound *hist1_lower;
+	RangeBound *hist1_upper;
+	RangeBound *hist2_lower;
+	RangeBound *hist2_upper;
+	bool		empty;
+	int			i;
+
+	{
+		bool	join_is_reversed;
+
+		get_join_variables(root, args, sjinfo, &vardata1, &vardata2,
+						   &join_is_reversed);
+	}
+
+	selec = default_range_selectivity(operator);
+
+	/*
+	 * Acquire histogram stats for both sides.  Each slot is tracked
+	 * independently so we can release exactly what was acquired on any
+	 * failure path.
+	 */
+	if (!HeapTupleIsValid(vardata1.statsTuple) ||
+		!HeapTupleIsValid(vardata2.statsTuple))
+		goto cleanup;
+
+	if (vardata1.vartype != vardata2.vartype)
+		goto cleanup;
+
+	memset(&hist1, 0, sizeof(hist1));
+	memset(&hist2, 0, sizeof(hist2));
+
+	if (!get_attstatsslot(&hist1, vardata1.statsTuple,
+						  STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+						  ATTSTATSSLOT_VALUES))
+		goto cleanup;
+	have_hist1 = true;
+
+	if (!get_attstatsslot(&hist2, vardata2.statsTuple,
+						  STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+						  ATTSTATSSLOT_VALUES))
+		goto cleanup;
+	have_hist2 = true;
+
+	/* Initialize type cache */
+	typcache = range_get_typcache(fcinfo, vardata1.vartype);
+
+	/* Look up NULL and empty-range fractions */
+	stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
+	stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
+
+	null_frac1 = stats1->stanullfrac;
+	null_frac2 = stats2->stanullfrac;
+
+	/* Try to get fraction of empty ranges for the first variable */
+	if (get_attstatsslot(&sslot, vardata1.statsTuple,
+						 STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+						 InvalidOid, ATTSTATSSLOT_NUMBERS))
+	{
+		if (sslot.nnumbers != 1)
+			elog(ERROR, "invalid empty fraction statistic");
+		empty_frac1 = sslot.numbers[0];
+		free_attstatsslot(&sslot);
+	}
+	else
+	{
+		empty_frac1 = 0.0;
+	}
+
+	/* Try to get fraction of empty ranges for the second variable */
+	if (get_attstatsslot(&sslot, vardata2.statsTuple,
+						 STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+						 InvalidOid, ATTSTATSSLOT_NUMBERS))
+	{
+		if (sslot.nnumbers != 1)
+			elog(ERROR, "invalid empty fraction statistic");
+		empty_frac2 = sslot.numbers[0];
+		free_attstatsslot(&sslot);
+	}
+	else
+	{
+		empty_frac2 = 0.0;
+	}
+
+	/* Convert range histograms to separate lower/upper bound arrays */
+	nhist1 = hist1.nvalues;
+	hist1_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist1);
+	hist1_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist1);
+	for (i = 0; i < nhist1; i++)
+	{
+		range_deserialize(typcache, DatumGetRangeTypeP(hist1.values[i]),
+						  &hist1_lower[i], &hist1_upper[i], &empty);
+		if (empty)
+			elog(ERROR, "bounds histogram contains an empty range");
+	}
+
+	nhist2 = hist2.nvalues;
+	hist2_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist2);
+	hist2_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist2);
+	for (i = 0; i < nhist2; i++)
+	{
+		range_deserialize(typcache, DatumGetRangeTypeP(hist2.values[i]),
+						  &hist2_lower[i], &hist2_upper[i], &empty);
+		if (empty)
+			elog(ERROR, "bounds histogram contains an empty range");
+	}
+
+	/* Estimate selectivity based on the operator */
+	switch (operator)
+	{
+		case OID_RANGE_OVERLAP_OP:
+
+			/*
+			 * A && B iff NOT(A << B) AND NOT(A >> B)
+			 * = 1 - P(A.upper < B.lower) - P(B.upper < A.lower)
+			 */
+			selec = 1;
+			selec -= calc_hist_join_selectivity(typcache,
+												hist1_upper, nhist1,
+												hist2_lower, nhist2);
+			selec -= calc_hist_join_selectivity(typcache,
+												hist2_upper, nhist2,
+												hist1_lower, nhist1);
+			break;
+
+		case OID_RANGE_LEFT_OP:
+			/* A << B iff upper(A) < lower(B) */
+			selec = calc_hist_join_selectivity(typcache,
+											   hist1_upper, nhist1,
+											   hist2_lower, nhist2);
+			break;
+
+		case OID_RANGE_RIGHT_OP:
+			/* A >> B iff upper(B) < lower(A) */
+			selec = calc_hist_join_selectivity(typcache,
+											   hist2_upper, nhist2,
+											   hist1_lower, nhist1);
+			break;
+
+		default:
+			/* Unsupported operator; keep the default selectivity */
+			goto cleanup;
+	}
+
+	/* The histogram-based selectivity applies to non-empty ranges only */
+	selec *= (1 - empty_frac1) * (1 - empty_frac2);
+
+	/*
+	 * For the supported operators (<<, >>, &&), empty ranges always produce
+	 * false, so no empty-fraction adjustment is needed.
+	 */
+
+	/* All range operators are strict */
+	selec *= (1 - null_frac1) * (1 - null_frac2);
+
+cleanup:
+	if (have_hist2)
+		free_attstatsslot(&hist2);
+	if (have_hist1)
+		free_attstatsslot(&hist1);
+
+	ReleaseVariableStats(vardata1);
+	ReleaseVariableStats(vardata2);
+
+	CLAMP_PROBABILITY(selec);
+
+	PG_RETURN_FLOAT8((float8) selec);
+}
diff --git a/src/include/catalog/pg_operator.dat b/src/include/catalog/pg_operator.dat
index 1465f13120a..5ea4434f9fa 100644
--- a/src/include/catalog/pg_operator.dat
+++ b/src/include/catalog/pg_operator.dat
@@ -3094,7 +3094,7 @@
   oprname => '&&', oprleft => 'anyrange', oprright => 'anyrange',
   oprresult => 'bool', oprcom => '&&(anyrange,anyrange)',
   oprcode => 'range_overlaps', oprrest => 'rangesel',
-  oprjoin => 'areajoinsel' },
+  oprjoin => 'rangejoinsel' },
 { oid => '3889', oid_symbol => 'OID_RANGE_CONTAINS_ELEM_OP',
   descr => 'contains',
   oprname => '@>', oprleft => 'anyrange', oprright => 'anyelement',
@@ -3122,12 +3122,12 @@
   oprname => '<<', oprleft => 'anyrange', oprright => 'anyrange',
   oprresult => 'bool', oprcom => '>>(anyrange,anyrange)',
   oprcode => 'range_before', oprrest => 'rangesel',
-  oprjoin => 'scalarltjoinsel' },
+  oprjoin => 'rangejoinsel' },
 { oid => '3894', oid_symbol => 'OID_RANGE_RIGHT_OP', descr => 'is right of',
   oprname => '>>', oprleft => 'anyrange', oprright => 'anyrange',
   oprresult => 'bool', oprcom => '<<(anyrange,anyrange)',
   oprcode => 'range_after', oprrest => 'rangesel',
-  oprjoin => 'scalargtjoinsel' },
+  oprjoin => 'rangejoinsel' },
 { oid => '3895', oid_symbol => 'OID_RANGE_OVERLAPS_LEFT_OP',
   descr => 'overlaps or is left of',
   oprname => '&<', oprleft => 'anyrange', oprright => 'anyrange',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ea17fc5629..c16aa8cec84 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12787,6 +12787,11 @@
   proname => 'error_on_null', proisstrict => 'f', prorettype => 'anyelement',
   proargtypes => 'anyelement', prosrc => 'pg_error_on_null' },
 
+{ oid => '8355', descr => 'join selectivity for range operators',
+  proname => 'rangejoinsel', provolatile => 's', prorettype => 'float8',
+  proargtypes => 'internal oid internal int2 internal',
+  prosrc => 'rangejoinsel' },
+
 { oid => '6321', descr => 'list of available WAL summary files',
   proname => 'pg_available_wal_summaries', prorows => '100', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out
index e062a4e5c2c..2fc5b770f90 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -2033,3 +2033,117 @@ select * from text_support_test where t <@ textrange_supp('a', 'd');
 
 drop table text_support_test;
 drop type textrange_supp;
+--
+-- test selectivity of range join operators
+--
+create table test_range_join_1 (ir1 int4range);
+create table test_range_join_2 (ir2 int4range);
+create table test_range_join_3 (ir3 int4range);
+insert into test_range_join_1 select int4range(g, g+10) from generate_series(1, 1000) g;
+insert into test_range_join_1 select int4range(g, g+100) from generate_series(1, 1000, 10) g;
+insert into test_range_join_2 select int4range(g, g+10) from generate_series(1, 500) g;
+insert into test_range_join_2 select int4range(g, g+100) from generate_series(1, 500, 10) g;
+insert into test_range_join_3 select int4range(g, g+10) from generate_series(501, 1000) g;
+insert into test_range_join_3 select int4range(g, g+100) from generate_series(501, 1000, 10) g;
+analyze test_range_join_1;
+analyze test_range_join_2;
+analyze test_range_join_3;
+-- reorder joins based on computed selectivity
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 && ir2 and ir2 && ir3;
+                                    QUERY PLAN                                     
+-----------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (test_range_join_1.ir1 && test_range_join_2.ir2)
+         ->  Seq Scan on test_range_join_1
+         ->  Materialize
+               ->  Nested Loop
+                     Join Filter: (test_range_join_2.ir2 && test_range_join_3.ir3)
+                     ->  Seq Scan on test_range_join_2
+                     ->  Materialize
+                           ->  Seq Scan on test_range_join_3
+(10 rows)
+
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 << ir2 and ir2 << ir3;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (test_range_join_2.ir2 << test_range_join_3.ir3)
+         ->  Nested Loop
+               Join Filter: (test_range_join_1.ir1 << test_range_join_2.ir2)
+               ->  Seq Scan on test_range_join_1
+               ->  Materialize
+                     ->  Seq Scan on test_range_join_2
+         ->  Materialize
+               ->  Seq Scan on test_range_join_3
+(10 rows)
+
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 >> ir2 and ir2 >> ir3;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (test_range_join_1.ir1 >> test_range_join_2.ir2)
+         ->  Nested Loop
+               Join Filter: (test_range_join_2.ir2 >> test_range_join_3.ir3)
+               ->  Seq Scan on test_range_join_2
+               ->  Materialize
+                     ->  Seq Scan on test_range_join_3
+         ->  Seq Scan on test_range_join_1
+(9 rows)
+
+drop table test_range_join_1;
+drop table test_range_join_2;
+drop table test_range_join_3;
+--
+-- test range join selectivity with fully disjoint histograms
+-- (exercises the bounds-check logic when histograms do not overlap)
+--
+create table test_range_join_lo (r int4range);
+create table test_range_join_hi (r int4range);
+-- low ranges: [1,11), [2,12), ... [500,510)
+insert into test_range_join_lo select int4range(g, g+10) from generate_series(1, 500) g;
+-- high ranges: [10001,10011), [10002,10012), ... [10500,10510)
+insert into test_range_join_hi select int4range(g, g+10) from generate_series(10001, 10500) g;
+analyze test_range_join_lo;
+analyze test_range_join_hi;
+-- lo << hi should produce a large selectivity (most pairs match)
+-- lo >> hi should produce a near-zero selectivity
+-- lo && hi should produce a near-zero selectivity (no overlap)
+-- These should not crash and should produce stable plans.
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r << b.r;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (a.r << b.r)
+         ->  Seq Scan on test_range_join_lo a
+         ->  Materialize
+               ->  Seq Scan on test_range_join_hi b
+(6 rows)
+
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r >> b.r;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (a.r >> b.r)
+         ->  Seq Scan on test_range_join_lo a
+         ->  Materialize
+               ->  Seq Scan on test_range_join_hi b
+(6 rows)
+
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r && b.r;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (a.r && b.r)
+         ->  Seq Scan on test_range_join_lo a
+         ->  Materialize
+               ->  Seq Scan on test_range_join_hi b
+(6 rows)
+
+drop table test_range_join_lo;
+drop table test_range_join_hi;
diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql
index 5c4b0337b7a..f69109da334 100644
--- a/src/test/regress/sql/rangetypes.sql
+++ b/src/test/regress/sql/rangetypes.sql
@@ -708,3 +708,56 @@ select * from text_support_test where t <@ textrange_supp('a', 'd');
 drop table text_support_test;
 
 drop type textrange_supp;
+
+--
+-- test selectivity of range join operators
+--
+create table test_range_join_1 (ir1 int4range);
+create table test_range_join_2 (ir2 int4range);
+create table test_range_join_3 (ir3 int4range);
+
+insert into test_range_join_1 select int4range(g, g+10) from generate_series(1, 1000) g;
+insert into test_range_join_1 select int4range(g, g+100) from generate_series(1, 1000, 10) g;
+insert into test_range_join_2 select int4range(g, g+10) from generate_series(1, 500) g;
+insert into test_range_join_2 select int4range(g, g+100) from generate_series(1, 500, 10) g;
+insert into test_range_join_3 select int4range(g, g+10) from generate_series(501, 1000) g;
+insert into test_range_join_3 select int4range(g, g+100) from generate_series(501, 1000, 10) g;
+
+analyze test_range_join_1;
+analyze test_range_join_2;
+analyze test_range_join_3;
+
+-- reorder joins based on computed selectivity
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 && ir2 and ir2 && ir3;
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 << ir2 and ir2 << ir3;
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 >> ir2 and ir2 >> ir3;
+
+drop table test_range_join_1;
+drop table test_range_join_2;
+drop table test_range_join_3;
+
+--
+-- test range join selectivity with fully disjoint histograms
+-- (exercises the bounds-check logic when histograms do not overlap)
+--
+create table test_range_join_lo (r int4range);
+create table test_range_join_hi (r int4range);
+
+-- low ranges: [1,11), [2,12), ... [500,510)
+insert into test_range_join_lo select int4range(g, g+10) from generate_series(1, 500) g;
+-- high ranges: [10001,10011), [10002,10012), ... [10500,10510)
+insert into test_range_join_hi select int4range(g, g+10) from generate_series(10001, 10500) g;
+
+analyze test_range_join_lo;
+analyze test_range_join_hi;
+
+-- lo << hi should produce a large selectivity (most pairs match)
+-- lo >> hi should produce a near-zero selectivity
+-- lo && hi should produce a near-zero selectivity (no overlap)
+-- These should not crash and should produce stable plans.
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r << b.r;
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r >> b.r;
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r && b.r;
+
+drop table test_range_join_lo;
+drop table test_range_join_hi;
-- 
2.52.0

view thread (16+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Implement missing join selectivity estimation for range types
  In-Reply-To: <CABXr29HUT2m7YPpB4gOZHphYmwpwuEWFZ_p6oeyouhWw8Zc1Tw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox