public inbox for [email protected]
help / color / mirror / Atom feedFrom: SCHOEMANS Maxime <[email protected]>
To: Haibo Yan <[email protected]>
Cc: vignesh C <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: Damir Belyalov <[email protected]>
Cc: jian he <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: SAKR Mahmoud <[email protected]>
Cc: Diogo Repas <[email protected]>
Cc: Andrey Lepikhov <[email protected]>
Subject: Re: Implement missing join selectivity estimation for range types
Date: Thu, 16 Apr 2026 15:12:56 +0000
Message-ID: <AM0P190MB07243E890A7DFD6F24F70269F0232@AM0P190MB0724.EURP190.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <CABXr29HuYbrtrThYaE8GYy-wMxtej8sG4Hw+2oSfQvuM+c0XVw@mail.gmail.com>
References: <CAB4o4asMq3k6HN9WfDsssQ5DDVfAziB4TpiFJ8RBJgZTVuwC7g@mail.gmail.com>
<[email protected]>
<[email protected]>
<CAB4o4aud47V_iRyWtA8+ZAmdXDjCF165R73AeCjx2RL0nzQzHA@mail.gmail.com>
<[email protected]>
<[email protected]>
<CAB4o4asvPN=NT7KvS9zVQjZbdsiRW5t8aQctEkW7mxc4hbBxVQ@mail.gmail.com>
<[email protected]>
<[email protected]>
<[email protected]>
<CALH1Lgurg1y1DTeFOXOkpP=+X7saVGCh8gSjDLoCBOcFFWhz-A@mail.gmail.com>
<[email protected]>
<[email protected]>
<[email protected]>
<CALDaNm29NMhpPvoM49qqfYzVcRJq0CPTd-xSUcTF7RZQO2TASQ@mail.gmail.com>
<[email protected]>
<CALDaNm3A4j7=rmY128Eb3v8zi=hHAnGizUcbSX5ekZKHy-wmMA@mail.gmail.com>
<CACJufxFhES7untAPmxJDrB-y_pNgah_1HNbXjcOj6g=yxh3kdA@mail.gmail.com>
<CABXr29HUT2m7YPpB4gOZHphYmwpwuEWFZ_p6oeyouhWw8Zc1Tw@mail.gmail.com>
<DB8P190MB0731FF1A358E9AB16FB1F14FF0252@DB8P190MB0731.EURP190.PROD.OUTLOOK.COM>
<CABXr29Gts+FuMRkeA=JNAxB700W1EQW2F=W2wAjtJrGZDGTxGw@mail.gmail.com>
<DB8P190MB0731EA1B8381F48AA15B9437F0222@DB8P190MB0731.EURP190.PROD.OUTLOOK.COM>
<CABXr29HuYbrtrThYaE8GYy-wMxtej8sG4Hw+2oSfQvuM+c0XVw@mail.gmail.com>
Hi Haibo,
Attached is v7 with the changes we discussed.
Patch 2 now has an inline comment on the && case explaining the
outer-bounds approximation and its consistency with existing restriction
selectivity. The commit message mentions it as well.
Patch 3 uses a separate backend-private header (rangetypes_selfuncs.h)
instead of selfuncs.h.
Regards,
Maxime
Attachments:
[application/octet-stream] v7-0001-Improve-range-join-selectivity-estimation-for.patch (20.7K, 3-v7-0001-Improve-range-join-selectivity-estimation-for.patch)
download | inline diff:
From d19997bb7587076cc1e3fc0c81a1a06655f93bb5 Mon Sep 17 00:00:00 2001
From: Maxime Schoemans <[email protected]>
Date: Mon, 13 Apr 2026 16:04:23 +0200
Subject: [PATCH v7 1/3] Improve range join selectivity estimation for <<, >>,
&&
Teach rangejoinsel to estimate join selectivity for range operators
using bound histogram statistics instead of falling back to fixed
defaults. The estimation is based on a trapezoidal approximation of
P(X < Y) by parallel-scanning the bound histograms of both sides.
This improves planner row estimates especially when the two range
columns have clearly separated or strongly overlapping distributions.
Regression tests cover plan changes for representative range join cases.
Based on: Repas, Luo, Schoemans, Sakr (2022) "Selectivity Estimation
of Inequality Joins In Databases"
https://doi.org/10.48550/arXiv.2206.07396
---
src/backend/utils/adt/rangetypes_selfuncs.c | 300 ++++++++++++++++++++
src/include/catalog/pg_operator.dat | 6 +-
src/include/catalog/pg_proc.dat | 4 +
src/test/regress/expected/rangetypes.out | 114 ++++++++
src/test/regress/sql/rangetypes.sql | 53 ++++
5 files changed, 474 insertions(+), 3 deletions(-)
diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
index 75f1e7567d5..cc702f28610 100644
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -1221,3 +1221,303 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
return sum_frac;
}
+
+/*
+ * Estimate join selectivity P(X < Y) using rangebound histograms.
+ *
+ * Based on: Diogo Repas, Zhicheng Luo, Maxime Schoemans, Mahmoud Sakr, 2022
+ * "Selectivity Estimation of Inequality Joins In Databases"
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * hist1 and hist2 are arrays of RangeBound entries from the bounds histograms
+ * of two range-typed attributes X and Y, respectively. Each array has at
+ * least 2 entries (one histogram bin). The entries carry full bound metadata
+ * (lower/upper flag, inclusive/exclusive), and all comparisons use
+ * range_cmp_bounds() so that bound semantics are preserved.
+ *
+ * The algorithm models each attribute's distribution as a piecewise function
+ * derived from its histogram, then computes:
+ * P(X < Y) = 0.5 * sum( (F_X(prev) + F_X(cur)) * (F_Y(cur) - F_Y(prev)) )
+ * by parallel-scanning both histograms.
+ *
+ * The initial fast-forward loops skip histogram entries that fall entirely
+ * before the other histogram's range, so the main loop only processes the
+ * overlapping region. Bounds checks are required because the histograms may
+ * be completely disjoint (e.g., all of X is below all of Y).
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+ const RangeBound *hist1, int nhist1,
+ const RangeBound *hist2, int nhist2)
+{
+ int i,
+ j;
+ double selectivity = 0.0;
+ double prev_sel1 = -1.0; /* negative sentinel skips first iter */
+ double prev_sel2 = 0.0;
+
+ Assert(nhist1 > 1);
+ Assert(nhist2 > 1);
+
+ /*
+ * Fast-forward past hist1 entries that are entirely below hist2[0], and
+ * vice versa. Bounds checks prevent out-of-bounds access when the
+ * histograms are fully disjoint.
+ */
+ for (i = 0; i < nhist1 &&
+ range_cmp_bounds(typcache, &hist1[i], &hist2[0]) < 0; i++)
+ ;
+ for (j = 0; j < nhist2 &&
+ range_cmp_bounds(typcache, &hist2[j], &hist1[0]) < 0; j++)
+ ;
+
+ /*
+ * Handle fully-separated histograms. When all bounds in hist1 are below
+ * all bounds in hist2, P(X < Y) is ~1.0. When all of hist2 is below
+ * hist1, P(X < Y) is ~0.0. We return immediately rather than falling
+ * into the overlap walk with invalid indices.
+ */
+ if (i >= nhist1)
+ return 1.0;
+ if (j >= nhist2)
+ return 0.0;
+
+ /* Walk the overlapping region of both histograms */
+ while (i < nhist1 && j < nhist2)
+ {
+ double cur_sel1,
+ cur_sel2;
+ RangeBound cur_sync;
+ int cmp;
+
+ cmp = range_cmp_bounds(typcache, &hist1[i], &hist2[j]);
+ if (cmp < 0)
+ cur_sync = hist1[i++];
+ else if (cmp > 0)
+ cur_sync = hist2[j++];
+ else
+ {
+ /* Equal bounds: advance both */
+ cur_sync = hist1[i];
+ i++;
+ j++;
+ }
+ cur_sel1 = calc_hist_selectivity_scalar(typcache, &cur_sync,
+ hist1, nhist1, false);
+ cur_sel2 = calc_hist_selectivity_scalar(typcache, &cur_sync,
+ hist2, nhist2, false);
+
+ /* Skip the first iteration (no previous point yet) */
+ if (prev_sel1 >= 0)
+ selectivity += (prev_sel1 + cur_sel1) * (cur_sel2 - prev_sel2);
+
+ prev_sel1 = cur_sel1;
+ prev_sel2 = cur_sel2;
+ }
+
+ /* P(X < Y) = 0.5 * Sum(...) */
+ selectivity /= 2;
+
+ /* Include remainder of hist2 if hist1 was exhausted first */
+ if (j < nhist2)
+ selectivity += 1 - prev_sel2;
+
+ return selectivity;
+}
+
+/*
+ * rangejoinsel -- join selectivity for range-vs-range operators
+ *
+ * Supports: <<, >>, &&
+ * These operators map directly to strict bound comparisons P(X < Y),
+ * which calc_hist_join_selectivity() estimates from bound histograms.
+ * Other range operators are left to their existing generic estimators.
+ */
+Datum
+rangejoinsel(PG_FUNCTION_ARGS)
+{
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ Oid operator = PG_GETARG_OID(1);
+ List *args = (List *) PG_GETARG_POINTER(2);
+ SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4);
+ VariableStatData vardata1;
+ VariableStatData vardata2;
+ Selectivity selec;
+ AttStatsSlot hist1;
+ AttStatsSlot hist2;
+ AttStatsSlot sslot;
+ bool have_hist1 = false;
+ bool have_hist2 = false;
+ TypeCacheEntry *typcache;
+ Form_pg_statistic stats1;
+ Form_pg_statistic stats2;
+ double empty_frac1;
+ double empty_frac2;
+ double null_frac1;
+ double null_frac2;
+ int nhist1;
+ int nhist2;
+ RangeBound *hist1_lower;
+ RangeBound *hist1_upper;
+ RangeBound *hist2_lower;
+ RangeBound *hist2_upper;
+ bool join_is_reversed;
+ bool empty;
+ int i;
+
+ get_join_variables(root, args, sjinfo, &vardata1, &vardata2,
+ &join_is_reversed);
+
+ selec = default_range_selectivity(operator);
+
+ /*
+ * Acquire histogram stats for both sides. Each slot is tracked
+ * independently so we can release exactly what was acquired on any
+ * failure path.
+ */
+ if (!HeapTupleIsValid(vardata1.statsTuple) ||
+ !HeapTupleIsValid(vardata2.statsTuple))
+ goto cleanup;
+
+ if (vardata1.vartype != vardata2.vartype)
+ goto cleanup;
+
+ memset(&hist1, 0, sizeof(hist1));
+ memset(&hist2, 0, sizeof(hist2));
+
+ if (!get_attstatsslot(&hist1, vardata1.statsTuple,
+ STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+ ATTSTATSSLOT_VALUES))
+ goto cleanup;
+ have_hist1 = true;
+
+ if (!get_attstatsslot(&hist2, vardata2.statsTuple,
+ STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+ ATTSTATSSLOT_VALUES))
+ goto cleanup;
+ have_hist2 = true;
+
+ /* Initialize type cache */
+ typcache = range_get_typcache(fcinfo, vardata1.vartype);
+
+ /* Look up NULL and empty-range fractions */
+ stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
+ stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
+
+ null_frac1 = stats1->stanullfrac;
+ null_frac2 = stats2->stanullfrac;
+
+ /* Try to get fraction of empty ranges for the first variable */
+ if (get_attstatsslot(&sslot, vardata1.statsTuple,
+ STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+ InvalidOid, ATTSTATSSLOT_NUMBERS))
+ {
+ if (sslot.nnumbers != 1)
+ elog(ERROR, "invalid empty fraction statistic");
+ empty_frac1 = sslot.numbers[0];
+ free_attstatsslot(&sslot);
+ }
+ else
+ {
+ empty_frac1 = 0.0;
+ }
+
+ /* Try to get fraction of empty ranges for the second variable */
+ if (get_attstatsslot(&sslot, vardata2.statsTuple,
+ STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+ InvalidOid, ATTSTATSSLOT_NUMBERS))
+ {
+ if (sslot.nnumbers != 1)
+ elog(ERROR, "invalid empty fraction statistic");
+ empty_frac2 = sslot.numbers[0];
+ free_attstatsslot(&sslot);
+ }
+ else
+ {
+ empty_frac2 = 0.0;
+ }
+
+ /* Convert range histograms to separate lower/upper bound arrays */
+ nhist1 = hist1.nvalues;
+ hist1_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist1);
+ hist1_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist1);
+ for (i = 0; i < nhist1; i++)
+ {
+ range_deserialize(typcache, DatumGetRangeTypeP(hist1.values[i]),
+ &hist1_lower[i], &hist1_upper[i], &empty);
+ if (empty)
+ elog(ERROR, "bounds histogram contains an empty range");
+ }
+
+ nhist2 = hist2.nvalues;
+ hist2_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist2);
+ hist2_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist2);
+ for (i = 0; i < nhist2; i++)
+ {
+ range_deserialize(typcache, DatumGetRangeTypeP(hist2.values[i]),
+ &hist2_lower[i], &hist2_upper[i], &empty);
+ if (empty)
+ elog(ERROR, "bounds histogram contains an empty range");
+ }
+
+ /* Estimate selectivity based on the operator */
+ switch (operator)
+ {
+ case OID_RANGE_OVERLAP_OP:
+
+ /*
+ * A && B iff NOT(A << B) AND NOT(A >> B) = 1 - P(A.upper <
+ * B.lower) - P(B.upper < A.lower)
+ */
+ selec = 1;
+ selec -= calc_hist_join_selectivity(typcache,
+ hist1_upper, nhist1,
+ hist2_lower, nhist2);
+ selec -= calc_hist_join_selectivity(typcache,
+ hist2_upper, nhist2,
+ hist1_lower, nhist1);
+ break;
+
+ case OID_RANGE_LEFT_OP:
+ /* A << B iff upper(A) < lower(B) */
+ selec = calc_hist_join_selectivity(typcache,
+ hist1_upper, nhist1,
+ hist2_lower, nhist2);
+ break;
+
+ case OID_RANGE_RIGHT_OP:
+ /* A >> B iff upper(B) < lower(A) */
+ selec = calc_hist_join_selectivity(typcache,
+ hist2_upper, nhist2,
+ hist1_lower, nhist1);
+ break;
+
+ default:
+ /* Unsupported operator; keep the default selectivity */
+ goto cleanup;
+ }
+
+ /* The histogram-based selectivity applies to non-empty ranges only */
+ selec *= (1 - empty_frac1) * (1 - empty_frac2);
+
+ /*
+ * For the supported operators (<<, >>, &&), empty ranges always produce
+ * false, so no empty-fraction adjustment is needed.
+ */
+
+ /* All range operators are strict */
+ selec *= (1 - null_frac1) * (1 - null_frac2);
+
+cleanup:
+ if (have_hist2)
+ free_attstatsslot(&hist2);
+ if (have_hist1)
+ free_attstatsslot(&hist1);
+
+ ReleaseVariableStats(vardata1);
+ ReleaseVariableStats(vardata2);
+
+ CLAMP_PROBABILITY(selec);
+
+ PG_RETURN_FLOAT8((float8) selec);
+}
diff --git a/src/include/catalog/pg_operator.dat b/src/include/catalog/pg_operator.dat
index 1465f13120a..5ea4434f9fa 100644
--- a/src/include/catalog/pg_operator.dat
+++ b/src/include/catalog/pg_operator.dat
@@ -3094,7 +3094,7 @@
oprname => '&&', oprleft => 'anyrange', oprright => 'anyrange',
oprresult => 'bool', oprcom => '&&(anyrange,anyrange)',
oprcode => 'range_overlaps', oprrest => 'rangesel',
- oprjoin => 'areajoinsel' },
+ oprjoin => 'rangejoinsel' },
{ oid => '3889', oid_symbol => 'OID_RANGE_CONTAINS_ELEM_OP',
descr => 'contains',
oprname => '@>', oprleft => 'anyrange', oprright => 'anyelement',
@@ -3122,12 +3122,12 @@
oprname => '<<', oprleft => 'anyrange', oprright => 'anyrange',
oprresult => 'bool', oprcom => '>>(anyrange,anyrange)',
oprcode => 'range_before', oprrest => 'rangesel',
- oprjoin => 'scalarltjoinsel' },
+ oprjoin => 'rangejoinsel' },
{ oid => '3894', oid_symbol => 'OID_RANGE_RIGHT_OP', descr => 'is right of',
oprname => '>>', oprleft => 'anyrange', oprright => 'anyrange',
oprresult => 'bool', oprcom => '<<(anyrange,anyrange)',
oprcode => 'range_after', oprrest => 'rangesel',
- oprjoin => 'scalargtjoinsel' },
+ oprjoin => 'rangejoinsel' },
{ oid => '3895', oid_symbol => 'OID_RANGE_OVERLAPS_LEFT_OP',
descr => 'overlaps or is left of',
oprname => '&<', oprleft => 'anyrange', oprright => 'anyrange',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 99fa9a6ede2..c6a707acae4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12919,4 +12919,8 @@
proname => 'hashoid8extended', prorettype => 'int8',
proargtypes => 'oid8 int8', prosrc => 'hashoid8extended' },
+{ oid => '8355', descr => 'join selectivity for range operators',
+ proname => 'rangejoinsel', provolatile => 's', prorettype => 'float8',
+ proargtypes => 'internal oid internal int2 internal',
+ prosrc => 'rangejoinsel' },
]
diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out
index e062a4e5c2c..2fc5b770f90 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -2033,3 +2033,117 @@ select * from text_support_test where t <@ textrange_supp('a', 'd');
drop table text_support_test;
drop type textrange_supp;
+--
+-- test selectivity of range join operators
+--
+create table test_range_join_1 (ir1 int4range);
+create table test_range_join_2 (ir2 int4range);
+create table test_range_join_3 (ir3 int4range);
+insert into test_range_join_1 select int4range(g, g+10) from generate_series(1, 1000) g;
+insert into test_range_join_1 select int4range(g, g+100) from generate_series(1, 1000, 10) g;
+insert into test_range_join_2 select int4range(g, g+10) from generate_series(1, 500) g;
+insert into test_range_join_2 select int4range(g, g+100) from generate_series(1, 500, 10) g;
+insert into test_range_join_3 select int4range(g, g+10) from generate_series(501, 1000) g;
+insert into test_range_join_3 select int4range(g, g+100) from generate_series(501, 1000, 10) g;
+analyze test_range_join_1;
+analyze test_range_join_2;
+analyze test_range_join_3;
+-- reorder joins based on computed selectivity
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 && ir2 and ir2 && ir3;
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (test_range_join_1.ir1 && test_range_join_2.ir2)
+ -> Seq Scan on test_range_join_1
+ -> Materialize
+ -> Nested Loop
+ Join Filter: (test_range_join_2.ir2 && test_range_join_3.ir3)
+ -> Seq Scan on test_range_join_2
+ -> Materialize
+ -> Seq Scan on test_range_join_3
+(10 rows)
+
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 << ir2 and ir2 << ir3;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (test_range_join_2.ir2 << test_range_join_3.ir3)
+ -> Nested Loop
+ Join Filter: (test_range_join_1.ir1 << test_range_join_2.ir2)
+ -> Seq Scan on test_range_join_1
+ -> Materialize
+ -> Seq Scan on test_range_join_2
+ -> Materialize
+ -> Seq Scan on test_range_join_3
+(10 rows)
+
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 >> ir2 and ir2 >> ir3;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (test_range_join_1.ir1 >> test_range_join_2.ir2)
+ -> Nested Loop
+ Join Filter: (test_range_join_2.ir2 >> test_range_join_3.ir3)
+ -> Seq Scan on test_range_join_2
+ -> Materialize
+ -> Seq Scan on test_range_join_3
+ -> Seq Scan on test_range_join_1
+(9 rows)
+
+drop table test_range_join_1;
+drop table test_range_join_2;
+drop table test_range_join_3;
+--
+-- test range join selectivity with fully disjoint histograms
+-- (exercises the bounds-check logic when histograms do not overlap)
+--
+create table test_range_join_lo (r int4range);
+create table test_range_join_hi (r int4range);
+-- low ranges: [1,11), [2,12), ... [500,510)
+insert into test_range_join_lo select int4range(g, g+10) from generate_series(1, 500) g;
+-- high ranges: [10001,10011), [10002,10012), ... [10500,10510)
+insert into test_range_join_hi select int4range(g, g+10) from generate_series(10001, 10500) g;
+analyze test_range_join_lo;
+analyze test_range_join_hi;
+-- lo << hi should produce a large selectivity (most pairs match)
+-- lo >> hi should produce a near-zero selectivity
+-- lo && hi should produce a near-zero selectivity (no overlap)
+-- These should not crash and should produce stable plans.
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r << b.r;
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r << b.r)
+ -> Seq Scan on test_range_join_lo a
+ -> Materialize
+ -> Seq Scan on test_range_join_hi b
+(6 rows)
+
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r >> b.r;
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r >> b.r)
+ -> Seq Scan on test_range_join_lo a
+ -> Materialize
+ -> Seq Scan on test_range_join_hi b
+(6 rows)
+
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r && b.r;
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r && b.r)
+ -> Seq Scan on test_range_join_lo a
+ -> Materialize
+ -> Seq Scan on test_range_join_hi b
+(6 rows)
+
+drop table test_range_join_lo;
+drop table test_range_join_hi;
diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql
index 5c4b0337b7a..f69109da334 100644
--- a/src/test/regress/sql/rangetypes.sql
+++ b/src/test/regress/sql/rangetypes.sql
@@ -708,3 +708,56 @@ select * from text_support_test where t <@ textrange_supp('a', 'd');
drop table text_support_test;
drop type textrange_supp;
+
+--
+-- test selectivity of range join operators
+--
+create table test_range_join_1 (ir1 int4range);
+create table test_range_join_2 (ir2 int4range);
+create table test_range_join_3 (ir3 int4range);
+
+insert into test_range_join_1 select int4range(g, g+10) from generate_series(1, 1000) g;
+insert into test_range_join_1 select int4range(g, g+100) from generate_series(1, 1000, 10) g;
+insert into test_range_join_2 select int4range(g, g+10) from generate_series(1, 500) g;
+insert into test_range_join_2 select int4range(g, g+100) from generate_series(1, 500, 10) g;
+insert into test_range_join_3 select int4range(g, g+10) from generate_series(501, 1000) g;
+insert into test_range_join_3 select int4range(g, g+100) from generate_series(501, 1000, 10) g;
+
+analyze test_range_join_1;
+analyze test_range_join_2;
+analyze test_range_join_3;
+
+-- reorder joins based on computed selectivity
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 && ir2 and ir2 && ir3;
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 << ir2 and ir2 << ir3;
+explain (costs off) select count(*) from test_range_join_1, test_range_join_2, test_range_join_3 where ir1 >> ir2 and ir2 >> ir3;
+
+drop table test_range_join_1;
+drop table test_range_join_2;
+drop table test_range_join_3;
+
+--
+-- test range join selectivity with fully disjoint histograms
+-- (exercises the bounds-check logic when histograms do not overlap)
+--
+create table test_range_join_lo (r int4range);
+create table test_range_join_hi (r int4range);
+
+-- low ranges: [1,11), [2,12), ... [500,510)
+insert into test_range_join_lo select int4range(g, g+10) from generate_series(1, 500) g;
+-- high ranges: [10001,10011), [10002,10012), ... [10500,10510)
+insert into test_range_join_hi select int4range(g, g+10) from generate_series(10001, 10500) g;
+
+analyze test_range_join_lo;
+analyze test_range_join_hi;
+
+-- lo << hi should produce a large selectivity (most pairs match)
+-- lo >> hi should produce a near-zero selectivity
+-- lo && hi should produce a near-zero selectivity (no overlap)
+-- These should not crash and should produce stable plans.
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r << b.r;
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r >> b.r;
+explain (costs off) select count(*) from test_range_join_lo a, test_range_join_hi b where a.r && b.r;
+
+drop table test_range_join_lo;
+drop table test_range_join_hi;
--
2.50.1 (Apple Git-155)
[application/octet-stream] v7-0002-Improve-multirange-join-selectivity-estimation-fo.patch (26.6K, 4-v7-0002-Improve-multirange-join-selectivity-estimation-fo.patch)
download | inline diff:
From deb986653ba56f0580b03484e3f4a22716bac9f7 Mon Sep 17 00:00:00 2001
From: Maxime Schoemans <[email protected]>
Date: Mon, 13 Apr 2026 16:06:03 +0200
Subject: [PATCH v7 2/3] Improve multirange join selectivity estimation for <<,
>>, &&
Add multirangejoinsel() to estimate join selectivity for multirange
operators using bound histograms, covering all type combinations:
multirange vs multirange, multirange vs range, range vs multirange.
Note that multirange statistics only represent the outermost bounds
(see multirange_typanalyze), so && may overestimate overlap for sparse
multiranges. This is consistent with how existing restriction
selectivity handles multirange &&.
The shared helper functions (calc_hist_join_selectivity and others) are
intentionally duplicated from rangetypes_selfuncs.c for reviewability.
A follow-up commit will remove the duplication.
---
.../utils/adt/multirangetypes_selfuncs.c | 325 ++++++++++++++++++
src/include/catalog/pg_operator.dat | 18 +-
src/include/catalog/pg_proc.dat | 4 +
src/test/regress/expected/multirangetypes.out | 157 +++++++++
src/test/regress/sql/multirangetypes.sql | 72 ++++
5 files changed, 567 insertions(+), 9 deletions(-)
diff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 533111445e7..241f8c6dbe0 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1334,3 +1334,328 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
return sum_frac;
}
+
+/*
+ * Estimate join selectivity P(X < Y) using rangebound histograms.
+ *
+ * Based on: Diogo Repas, Zhicheng Luo, Maxime Schoemans, Mahmoud Sakr, 2022
+ * "Selectivity Estimation of Inequality Joins In Databases"
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * hist1 and hist2 are arrays of RangeBound entries from the bounds histograms
+ * of two range-typed or multirange-typed attributes X and Y, respectively.
+ * Each array has at least 2 entries (one histogram bin). The entries carry
+ * full bound metadata (lower/upper flag, inclusive/exclusive), and all
+ * comparisons use range_cmp_bounds() so that bound semantics are preserved.
+ *
+ * The algorithm models each attribute's distribution as a piecewise function
+ * derived from its histogram, then computes:
+ * P(X < Y) = 0.5 * sum( (F_X(prev) + F_X(cur)) * (F_Y(cur) - F_Y(prev)) )
+ * by parallel-scanning both histograms.
+ *
+ * The initial fast-forward loops skip histogram entries that fall entirely
+ * before the other histogram's range, so the main loop only processes the
+ * overlapping region. Bounds checks are required because the histograms may
+ * be completely disjoint (e.g., all of X is below all of Y).
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+ const RangeBound *hist1, int nhist1,
+ const RangeBound *hist2, int nhist2)
+{
+ int i,
+ j;
+ double selectivity = 0.0;
+ double prev_sel1 = -1.0; /* negative sentinel skips first iter */
+ double prev_sel2 = 0.0;
+
+ Assert(nhist1 > 1);
+ Assert(nhist2 > 1);
+
+ /*
+ * Fast-forward past hist1 entries that are entirely below hist2[0], and
+ * vice versa. Bounds checks prevent out-of-bounds access when the
+ * histograms are fully disjoint.
+ */
+ for (i = 0; i < nhist1 &&
+ range_cmp_bounds(typcache, &hist1[i], &hist2[0]) < 0; i++)
+ ;
+ for (j = 0; j < nhist2 &&
+ range_cmp_bounds(typcache, &hist2[j], &hist1[0]) < 0; j++)
+ ;
+
+ /*
+ * Handle fully-separated histograms. When all bounds in hist1 are below
+ * all bounds in hist2, P(X < Y) is ~1.0. When all of hist2 is below
+ * hist1, P(X < Y) is ~0.0. We return immediately rather than falling
+ * into the overlap walk with invalid indices.
+ */
+ if (i >= nhist1)
+ return 1.0;
+ if (j >= nhist2)
+ return 0.0;
+
+ /* Walk the overlapping region of both histograms */
+ while (i < nhist1 && j < nhist2)
+ {
+ double cur_sel1,
+ cur_sel2;
+ RangeBound cur_sync;
+ int cmp;
+
+ cmp = range_cmp_bounds(typcache, &hist1[i], &hist2[j]);
+ if (cmp < 0)
+ cur_sync = hist1[i++];
+ else if (cmp > 0)
+ cur_sync = hist2[j++];
+ else
+ {
+ /* Equal bounds: advance both */
+ cur_sync = hist1[i];
+ i++;
+ j++;
+ }
+ cur_sel1 = calc_hist_selectivity_scalar(typcache, &cur_sync,
+ hist1, nhist1, false);
+ cur_sel2 = calc_hist_selectivity_scalar(typcache, &cur_sync,
+ hist2, nhist2, false);
+
+ /* Skip the first iteration (no previous point yet) */
+ if (prev_sel1 >= 0)
+ selectivity += (prev_sel1 + cur_sel1) * (cur_sel2 - prev_sel2);
+
+ prev_sel1 = cur_sel1;
+ prev_sel2 = cur_sel2;
+ }
+
+ /* P(X < Y) = 0.5 * Sum(...) */
+ selectivity /= 2;
+
+ /* Include remainder of hist2 if hist1 was exhausted first */
+ if (j < nhist2)
+ selectivity += 1 - prev_sel2;
+
+ return selectivity;
+}
+
+/*
+ * multirangejoinsel -- join selectivity for multirange operators
+ *
+ * Supports: <<, >>, && for all type combinations:
+ * multirange vs multirange, multirange vs range, range vs multirange
+ *
+ * These operators map directly to strict bound comparisons P(X < Y),
+ * which calc_hist_join_selectivity() estimates from bound histograms.
+ * Both range and multirange types store bound histograms in the same
+ * format, so the estimation is identical regardless of type combination.
+ */
+Datum
+multirangejoinsel(PG_FUNCTION_ARGS)
+{
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ Oid operator = PG_GETARG_OID(1);
+ List *args = (List *) PG_GETARG_POINTER(2);
+ SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4);
+ VariableStatData vardata1;
+ VariableStatData vardata2;
+ Selectivity selec;
+ AttStatsSlot hist1;
+ AttStatsSlot hist2;
+ AttStatsSlot sslot;
+ bool have_hist1 = false;
+ bool have_hist2 = false;
+ TypeCacheEntry *typcache;
+ TypeCacheEntry *rng_typcache;
+ Form_pg_statistic stats1;
+ Form_pg_statistic stats2;
+ double empty_frac1;
+ double empty_frac2;
+ double null_frac1;
+ double null_frac2;
+ int nhist1;
+ int nhist2;
+ RangeBound *hist1_lower;
+ RangeBound *hist1_upper;
+ RangeBound *hist2_lower;
+ RangeBound *hist2_upper;
+ bool join_is_reversed;
+ bool empty;
+ int i;
+
+ get_join_variables(root, args, sjinfo, &vardata1, &vardata2,
+ &join_is_reversed);
+
+ selec = default_multirange_selectivity(operator);
+
+ /*
+ * Acquire histogram stats for both sides. Each slot is tracked
+ * independently so we can release exactly what was acquired on any
+ * failure path.
+ */
+ if (!HeapTupleIsValid(vardata1.statsTuple) ||
+ !HeapTupleIsValid(vardata2.statsTuple))
+ goto cleanup;
+
+ memset(&hist1, 0, sizeof(hist1));
+ memset(&hist2, 0, sizeof(hist2));
+
+ if (!get_attstatsslot(&hist1, vardata1.statsTuple,
+ STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+ ATTSTATSSLOT_VALUES))
+ goto cleanup;
+ have_hist1 = true;
+
+ if (!get_attstatsslot(&hist2, vardata2.statsTuple,
+ STATISTIC_KIND_BOUNDS_HISTOGRAM, InvalidOid,
+ ATTSTATSSLOT_VALUES))
+ goto cleanup;
+ have_hist2 = true;
+
+ /*
+ * Determine the range type cache for bound comparisons. At least one
+ * side is a multirange type; try vardata1 first, then vardata2.
+ */
+ typcache = lookup_type_cache(vardata1.vartype, TYPECACHE_MULTIRANGE_INFO);
+ if (typcache->rngtype != NULL)
+ rng_typcache = typcache->rngtype;
+ else
+ {
+ typcache = lookup_type_cache(vardata2.vartype,
+ TYPECACHE_MULTIRANGE_INFO);
+ rng_typcache = typcache->rngtype;
+ }
+
+ /* Look up NULL and empty-range fractions */
+ stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
+ stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
+
+ null_frac1 = stats1->stanullfrac;
+ null_frac2 = stats2->stanullfrac;
+
+ /* Try to get fraction of empty ranges for the first variable */
+ if (get_attstatsslot(&sslot, vardata1.statsTuple,
+ STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+ InvalidOid, ATTSTATSSLOT_NUMBERS))
+ {
+ if (sslot.nnumbers != 1)
+ elog(ERROR, "invalid empty fraction statistic");
+ empty_frac1 = sslot.numbers[0];
+ free_attstatsslot(&sslot);
+ }
+ else
+ {
+ empty_frac1 = 0.0;
+ }
+
+ /* Try to get fraction of empty ranges for the second variable */
+ if (get_attstatsslot(&sslot, vardata2.statsTuple,
+ STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM,
+ InvalidOid, ATTSTATSSLOT_NUMBERS))
+ {
+ if (sslot.nnumbers != 1)
+ elog(ERROR, "invalid empty fraction statistic");
+ empty_frac2 = sslot.numbers[0];
+ free_attstatsslot(&sslot);
+ }
+ else
+ {
+ empty_frac2 = 0.0;
+ }
+
+ /* Convert range histograms to separate lower/upper bound arrays */
+ nhist1 = hist1.nvalues;
+ hist1_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist1);
+ hist1_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist1);
+ for (i = 0; i < nhist1; i++)
+ {
+ range_deserialize(rng_typcache, DatumGetRangeTypeP(hist1.values[i]),
+ &hist1_lower[i], &hist1_upper[i], &empty);
+ if (empty)
+ elog(ERROR, "bounds histogram contains an empty range");
+ }
+
+ nhist2 = hist2.nvalues;
+ hist2_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist2);
+ hist2_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist2);
+ for (i = 0; i < nhist2; i++)
+ {
+ range_deserialize(rng_typcache, DatumGetRangeTypeP(hist2.values[i]),
+ &hist2_lower[i], &hist2_upper[i], &empty);
+ if (empty)
+ elog(ERROR, "bounds histogram contains an empty range");
+ }
+
+ /* Estimate selectivity based on the operator */
+ switch (operator)
+ {
+ case OID_RANGE_OVERLAPS_MULTIRANGE_OP:
+ case OID_MULTIRANGE_OVERLAPS_RANGE_OP:
+ case OID_MULTIRANGE_OVERLAPS_MULTIRANGE_OP:
+
+ /*
+ * A && B iff NOT(A << B) AND NOT(A >> B) = 1 - P(A.upper <
+ * B.lower) - P(B.upper < A.lower)
+ *
+ * This decomposition is exact for single ranges. For
+ * multiranges, the bound histograms only represent the outermost
+ * lower and upper bounds (see multirange_typanalyze), so internal
+ * gaps are not captured. This can overestimate overlap for sparse
+ * multiranges, but is consistent with how existing restriction
+ * selectivity handles multirange &&.
+ */
+ selec = 1;
+ selec -= calc_hist_join_selectivity(rng_typcache,
+ hist1_upper, nhist1,
+ hist2_lower, nhist2);
+ selec -= calc_hist_join_selectivity(rng_typcache,
+ hist2_upper, nhist2,
+ hist1_lower, nhist1);
+ break;
+
+ case OID_RANGE_LEFT_MULTIRANGE_OP:
+ case OID_MULTIRANGE_LEFT_RANGE_OP:
+ case OID_MULTIRANGE_LEFT_MULTIRANGE_OP:
+ /* A << B iff upper(A) < lower(B) */
+ selec = calc_hist_join_selectivity(rng_typcache,
+ hist1_upper, nhist1,
+ hist2_lower, nhist2);
+ break;
+
+ case OID_RANGE_RIGHT_MULTIRANGE_OP:
+ case OID_MULTIRANGE_RIGHT_RANGE_OP:
+ case OID_MULTIRANGE_RIGHT_MULTIRANGE_OP:
+ /* A >> B iff upper(B) < lower(A) */
+ selec = calc_hist_join_selectivity(rng_typcache,
+ hist2_upper, nhist2,
+ hist1_lower, nhist1);
+ break;
+
+ default:
+ /* Unsupported operator; keep the default selectivity */
+ goto cleanup;
+ }
+
+ /* The histogram-based selectivity applies to non-empty ranges only */
+ selec *= (1 - empty_frac1) * (1 - empty_frac2);
+
+ /*
+ * For the supported operators (<<, >>, &&), empty ranges always produce
+ * false, so no empty-fraction adjustment is needed.
+ */
+
+ /* All multirange operators are strict */
+ selec *= (1 - null_frac1) * (1 - null_frac2);
+
+cleanup:
+ if (have_hist2)
+ free_attstatsslot(&hist2);
+ if (have_hist1)
+ free_attstatsslot(&hist1);
+
+ ReleaseVariableStats(vardata1);
+ ReleaseVariableStats(vardata2);
+
+ CLAMP_PROBABILITY(selec);
+
+ PG_RETURN_FLOAT8((float8) selec);
+}
diff --git a/src/include/catalog/pg_operator.dat b/src/include/catalog/pg_operator.dat
index 5ea4434f9fa..28f696a9f41 100644
--- a/src/include/catalog/pg_operator.dat
+++ b/src/include/catalog/pg_operator.dat
@@ -3302,19 +3302,19 @@
oprname => '&&', oprleft => 'anyrange', oprright => 'anymultirange',
oprresult => 'bool', oprcom => '&&(anymultirange,anyrange)',
oprcode => 'range_overlaps_multirange', oprrest => 'multirangesel',
- oprjoin => 'areajoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '2867', oid_symbol => 'OID_MULTIRANGE_OVERLAPS_RANGE_OP',
descr => 'overlaps',
oprname => '&&', oprleft => 'anymultirange', oprright => 'anyrange',
oprresult => 'bool', oprcom => '&&(anyrange,anymultirange)',
oprcode => 'multirange_overlaps_range', oprrest => 'multirangesel',
- oprjoin => 'areajoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '2868', oid_symbol => 'OID_MULTIRANGE_OVERLAPS_MULTIRANGE_OP',
descr => 'overlaps',
oprname => '&&', oprleft => 'anymultirange', oprright => 'anymultirange',
oprresult => 'bool', oprcom => '&&(anymultirange,anymultirange)',
oprcode => 'multirange_overlaps_multirange', oprrest => 'multirangesel',
- oprjoin => 'areajoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '2869', oid_symbol => 'OID_MULTIRANGE_CONTAINS_ELEM_OP',
descr => 'contains',
oprname => '@>', oprleft => 'anymultirange', oprright => 'anyelement',
@@ -3428,37 +3428,37 @@
oprname => '<<', oprleft => 'anyrange', oprright => 'anymultirange',
oprresult => 'bool', oprcom => '>>(anymultirange,anyrange)',
oprcode => 'range_before_multirange', oprrest => 'multirangesel',
- oprjoin => 'scalarltjoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '4396', oid_symbol => 'OID_MULTIRANGE_LEFT_RANGE_OP',
descr => 'is left of',
oprname => '<<', oprleft => 'anymultirange', oprright => 'anyrange',
oprresult => 'bool', oprcom => '>>(anyrange,anymultirange)',
oprcode => 'multirange_before_range', oprrest => 'multirangesel',
- oprjoin => 'scalarltjoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '4397', oid_symbol => 'OID_MULTIRANGE_LEFT_MULTIRANGE_OP',
descr => 'is left of',
oprname => '<<', oprleft => 'anymultirange', oprright => 'anymultirange',
oprresult => 'bool', oprcom => '>>(anymultirange,anymultirange)',
oprcode => 'multirange_before_multirange', oprrest => 'multirangesel',
- oprjoin => 'scalarltjoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '4398', oid_symbol => 'OID_RANGE_RIGHT_MULTIRANGE_OP',
descr => 'is right of',
oprname => '>>', oprleft => 'anyrange', oprright => 'anymultirange',
oprresult => 'bool', oprcom => '<<(anymultirange,anyrange)',
oprcode => 'range_after_multirange', oprrest => 'multirangesel',
- oprjoin => 'scalargtjoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '4399', oid_symbol => 'OID_MULTIRANGE_RIGHT_RANGE_OP',
descr => 'is right of',
oprname => '>>', oprleft => 'anymultirange', oprright => 'anyrange',
oprresult => 'bool', oprcom => '<<(anyrange,anymultirange)',
oprcode => 'multirange_after_range', oprrest => 'multirangesel',
- oprjoin => 'scalargtjoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '4400', oid_symbol => 'OID_MULTIRANGE_RIGHT_MULTIRANGE_OP',
descr => 'is right of',
oprname => '>>', oprleft => 'anymultirange', oprright => 'anymultirange',
oprresult => 'bool', oprcom => '<<(anymultirange,anymultirange)',
oprcode => 'multirange_after_multirange', oprrest => 'multirangesel',
- oprjoin => 'scalargtjoinsel' },
+ oprjoin => 'multirangejoinsel' },
{ oid => '8262', descr => 'equal',
oprname => '=', oprcanmerge => 't', oprcanhash => 't', oprleft => 'oid8',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c6a707acae4..10fbc22c4a6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12923,4 +12923,8 @@
proname => 'rangejoinsel', provolatile => 's', prorettype => 'float8',
proargtypes => 'internal oid internal int2 internal',
prosrc => 'rangejoinsel' },
+{ oid => '8356', descr => 'join selectivity for multirange operators',
+ proname => 'multirangejoinsel', provolatile => 's', prorettype => 'float8',
+ proargtypes => 'internal oid internal int2 internal',
+ prosrc => 'multirangejoinsel' },
]
diff --git a/src/test/regress/expected/multirangetypes.out b/src/test/regress/expected/multirangetypes.out
index f5e7df8df43..aab9c5e2604 100644
--- a/src/test/regress/expected/multirangetypes.out
+++ b/src/test/regress/expected/multirangetypes.out
@@ -3512,3 +3512,160 @@ create function mr_table_fail(i anyelement) returns table(i anyelement, r anymul
as $$ select $1, '[1,10]' $$ language sql;
ERROR: cannot determine result data type
DETAIL: A result of type anymultirange requires at least one input of type anyrange or anymultirange.
+-- Restore GUCs changed by earlier index tests
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- test selectivity of multirange join operators
+--
+create table test_mr_join_1 (mr1 int4multirange);
+create table test_mr_join_2 (mr2 int4multirange);
+create table test_mr_join_3 (mr3 int4multirange);
+insert into test_mr_join_1 select int4multirange(int4range(g, g+10)) from generate_series(1, 1000) g;
+insert into test_mr_join_1 select int4multirange(int4range(g, g+100)) from generate_series(1, 1000, 10) g;
+insert into test_mr_join_2 select int4multirange(int4range(g, g+10)) from generate_series(1, 500) g;
+insert into test_mr_join_2 select int4multirange(int4range(g, g+100)) from generate_series(1, 500, 10) g;
+insert into test_mr_join_3 select int4multirange(int4range(g, g+10)) from generate_series(501, 1000) g;
+insert into test_mr_join_3 select int4multirange(int4range(g, g+100)) from generate_series(501, 1000, 10) g;
+analyze test_mr_join_1;
+analyze test_mr_join_2;
+analyze test_mr_join_3;
+-- multirange vs multirange: reorder joins based on computed selectivity
+explain (costs off) select count(*) from test_mr_join_1, test_mr_join_2, test_mr_join_3 where mr1 && mr2 and mr2 && mr3;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (test_mr_join_1.mr1 && test_mr_join_2.mr2)
+ -> Seq Scan on test_mr_join_1
+ -> Materialize
+ -> Nested Loop
+ Join Filter: (test_mr_join_2.mr2 && test_mr_join_3.mr3)
+ -> Seq Scan on test_mr_join_2
+ -> Materialize
+ -> Seq Scan on test_mr_join_3
+(10 rows)
+
+explain (costs off) select count(*) from test_mr_join_1, test_mr_join_2, test_mr_join_3 where mr1 << mr2 and mr2 << mr3;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (test_mr_join_2.mr2 << test_mr_join_3.mr3)
+ -> Nested Loop
+ Join Filter: (test_mr_join_1.mr1 << test_mr_join_2.mr2)
+ -> Seq Scan on test_mr_join_1
+ -> Materialize
+ -> Seq Scan on test_mr_join_2
+ -> Materialize
+ -> Seq Scan on test_mr_join_3
+(10 rows)
+
+explain (costs off) select count(*) from test_mr_join_1, test_mr_join_2, test_mr_join_3 where mr1 >> mr2 and mr2 >> mr3;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (test_mr_join_1.mr1 >> test_mr_join_2.mr2)
+ -> Nested Loop
+ Join Filter: (test_mr_join_2.mr2 >> test_mr_join_3.mr3)
+ -> Seq Scan on test_mr_join_2
+ -> Materialize
+ -> Seq Scan on test_mr_join_3
+ -> Seq Scan on test_mr_join_1
+(9 rows)
+
+drop table test_mr_join_1;
+drop table test_mr_join_2;
+drop table test_mr_join_3;
+--
+-- test multirange join selectivity with fully disjoint histograms
+--
+create table test_mr_join_lo (r int4multirange);
+create table test_mr_join_hi (r int4multirange);
+insert into test_mr_join_lo select int4multirange(int4range(g, g+10)) from generate_series(1, 500) g;
+insert into test_mr_join_hi select int4multirange(int4range(g, g+10)) from generate_series(10001, 10500) g;
+analyze test_mr_join_lo;
+analyze test_mr_join_hi;
+-- These should not crash and should produce stable plans.
+explain (costs off) select count(*) from test_mr_join_lo a, test_mr_join_hi b where a.r << b.r;
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r << b.r)
+ -> Seq Scan on test_mr_join_lo a
+ -> Materialize
+ -> Seq Scan on test_mr_join_hi b
+(6 rows)
+
+explain (costs off) select count(*) from test_mr_join_lo a, test_mr_join_hi b where a.r >> b.r;
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r >> b.r)
+ -> Seq Scan on test_mr_join_lo a
+ -> Materialize
+ -> Seq Scan on test_mr_join_hi b
+(6 rows)
+
+explain (costs off) select count(*) from test_mr_join_lo a, test_mr_join_hi b where a.r && b.r;
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r && b.r)
+ -> Seq Scan on test_mr_join_lo a
+ -> Materialize
+ -> Seq Scan on test_mr_join_hi b
+(6 rows)
+
+drop table test_mr_join_lo;
+drop table test_mr_join_hi;
+--
+-- test range vs multirange join selectivity
+--
+create table test_mr_join_r (r int4range);
+create table test_mr_join_mr (mr int4multirange);
+insert into test_mr_join_r select int4range(g, g+10) from generate_series(1, 500) g;
+insert into test_mr_join_mr select int4multirange(int4range(g, g+10)) from generate_series(10001, 10500) g;
+analyze test_mr_join_r;
+analyze test_mr_join_mr;
+-- range vs multirange operators should use multirangejoinsel
+explain (costs off) select count(*) from test_mr_join_r a, test_mr_join_mr b where a.r << b.mr;
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r << b.mr)
+ -> Seq Scan on test_mr_join_r a
+ -> Materialize
+ -> Seq Scan on test_mr_join_mr b
+(6 rows)
+
+explain (costs off) select count(*) from test_mr_join_r a, test_mr_join_mr b where a.r >> b.mr;
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r >> b.mr)
+ -> Seq Scan on test_mr_join_r a
+ -> Materialize
+ -> Seq Scan on test_mr_join_mr b
+(6 rows)
+
+explain (costs off) select count(*) from test_mr_join_r a, test_mr_join_mr b where a.r && b.mr;
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ Join Filter: (a.r && b.mr)
+ -> Seq Scan on test_mr_join_r a
+ -> Materialize
+ -> Seq Scan on test_mr_join_mr b
+(6 rows)
+
+drop table test_mr_join_r;
+drop table test_mr_join_mr;
diff --git a/src/test/regress/sql/multirangetypes.sql b/src/test/regress/sql/multirangetypes.sql
index 112334b03eb..e3f8cd6f4e3 100644
--- a/src/test/regress/sql/multirangetypes.sql
+++ b/src/test/regress/sql/multirangetypes.sql
@@ -904,3 +904,75 @@ create function mr_inoutparam_fail(inout i anyelement, out r anymultirange)
--should fail
create function mr_table_fail(i anyelement) returns table(i anyelement, r anymultirange)
as $$ select $1, '[1,10]' $$ language sql;
+
+-- Restore GUCs changed by earlier index tests
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+
+--
+-- test selectivity of multirange join operators
+--
+create table test_mr_join_1 (mr1 int4multirange);
+create table test_mr_join_2 (mr2 int4multirange);
+create table test_mr_join_3 (mr3 int4multirange);
+
+insert into test_mr_join_1 select int4multirange(int4range(g, g+10)) from generate_series(1, 1000) g;
+insert into test_mr_join_1 select int4multirange(int4range(g, g+100)) from generate_series(1, 1000, 10) g;
+insert into test_mr_join_2 select int4multirange(int4range(g, g+10)) from generate_series(1, 500) g;
+insert into test_mr_join_2 select int4multirange(int4range(g, g+100)) from generate_series(1, 500, 10) g;
+insert into test_mr_join_3 select int4multirange(int4range(g, g+10)) from generate_series(501, 1000) g;
+insert into test_mr_join_3 select int4multirange(int4range(g, g+100)) from generate_series(501, 1000, 10) g;
+
+analyze test_mr_join_1;
+analyze test_mr_join_2;
+analyze test_mr_join_3;
+
+-- multirange vs multirange: reorder joins based on computed selectivity
+explain (costs off) select count(*) from test_mr_join_1, test_mr_join_2, test_mr_join_3 where mr1 && mr2 and mr2 && mr3;
+explain (costs off) select count(*) from test_mr_join_1, test_mr_join_2, test_mr_join_3 where mr1 << mr2 and mr2 << mr3;
+explain (costs off) select count(*) from test_mr_join_1, test_mr_join_2, test_mr_join_3 where mr1 >> mr2 and mr2 >> mr3;
+
+drop table test_mr_join_1;
+drop table test_mr_join_2;
+drop table test_mr_join_3;
+
+--
+-- test multirange join selectivity with fully disjoint histograms
+--
+create table test_mr_join_lo (r int4multirange);
+create table test_mr_join_hi (r int4multirange);
+
+insert into test_mr_join_lo select int4multirange(int4range(g, g+10)) from generate_series(1, 500) g;
+insert into test_mr_join_hi select int4multirange(int4range(g, g+10)) from generate_series(10001, 10500) g;
+
+analyze test_mr_join_lo;
+analyze test_mr_join_hi;
+
+-- These should not crash and should produce stable plans.
+explain (costs off) select count(*) from test_mr_join_lo a, test_mr_join_hi b where a.r << b.r;
+explain (costs off) select count(*) from test_mr_join_lo a, test_mr_join_hi b where a.r >> b.r;
+explain (costs off) select count(*) from test_mr_join_lo a, test_mr_join_hi b where a.r && b.r;
+
+drop table test_mr_join_lo;
+drop table test_mr_join_hi;
+
+--
+-- test range vs multirange join selectivity
+--
+create table test_mr_join_r (r int4range);
+create table test_mr_join_mr (mr int4multirange);
+
+insert into test_mr_join_r select int4range(g, g+10) from generate_series(1, 500) g;
+insert into test_mr_join_mr select int4multirange(int4range(g, g+10)) from generate_series(10001, 10500) g;
+
+analyze test_mr_join_r;
+analyze test_mr_join_mr;
+
+-- range vs multirange operators should use multirangejoinsel
+explain (costs off) select count(*) from test_mr_join_r a, test_mr_join_mr b where a.r << b.mr;
+explain (costs off) select count(*) from test_mr_join_r a, test_mr_join_mr b where a.r >> b.mr;
+explain (costs off) select count(*) from test_mr_join_r a, test_mr_join_mr b where a.r && b.mr;
+
+drop table test_mr_join_r;
+drop table test_mr_join_mr;
--
2.50.1 (Apple Git-155)
[application/octet-stream] v7-0003-Remove-duplicate-selectivity-functions-between-ra.patch (34.5K, 5-v7-0003-Remove-duplicate-selectivity-functions-between-ra.patch)
download | inline diff:
From 12665ca19f802a2ace50525cf5df8a6f95e860df Mon Sep 17 00:00:00 2001
From: Maxime Schoemans <[email protected]>
Date: Thu, 16 Apr 2026 16:28:17 +0200
Subject: [PATCH v7 3/3] Remove duplicate selectivity functions between range
and multirange
The multirange selectivity code duplicated 10 helper functions from
rangetypes_selfuncs.c. Since both range and multirange types use the
same histogram format (STATISTIC_KIND_BOUNDS_HISTOGRAM) and the same
RangeBound representation, the functions are identical.
Make the 10 shared functions non-static in rangetypes_selfuncs.c,
export them via a new rangetypes_selfuncs.h header, and remove the
copies from multirangetypes_selfuncs.c.
---
.../utils/adt/multirangetypes_selfuncs.c | 772 +-----------------
src/backend/utils/adt/rangetypes_selfuncs.c | 46 +-
src/include/utils/rangetypes_selfuncs.h | 54 ++
3 files changed, 67 insertions(+), 805 deletions(-)
create mode 100644 src/include/utils/rangetypes_selfuncs.h
diff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 241f8c6dbe0..fa5f23d09a9 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -27,6 +27,7 @@
#include "utils/lsyscache.h"
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
+#include "utils/rangetypes_selfuncs.h"
#include "utils/selfuncs.h"
#include "utils/typcache.h"
@@ -38,37 +39,6 @@ static double calc_hist_selectivity(TypeCacheEntry *typcache,
VariableStatData *vardata,
const MultirangeType *constval,
Oid operator);
-static double calc_hist_selectivity_scalar(TypeCacheEntry *typcache,
- const RangeBound *constbound,
- const RangeBound *hist,
- int hist_nvalues, bool equal);
-static int rbound_bsearch(TypeCacheEntry *typcache, const RangeBound *value,
- const RangeBound *hist, int hist_length, bool equal);
-static float8 get_position(TypeCacheEntry *typcache, const RangeBound *value,
- const RangeBound *hist1, const RangeBound *hist2);
-static float8 get_len_position(double value, double hist1, double hist2);
-static float8 get_distance(TypeCacheEntry *typcache, const RangeBound *bound1,
- const RangeBound *bound2);
-static int length_hist_bsearch(const Datum *length_hist_values,
- int length_hist_nvalues, double value,
- bool equal);
-static double calc_length_hist_frac(const Datum *length_hist_values,
- int length_hist_nvalues, double length1,
- double length2, bool equal);
-static double calc_hist_selectivity_contained(TypeCacheEntry *typcache,
- const RangeBound *lower,
- RangeBound *upper,
- const RangeBound *hist_lower,
- int hist_nvalues,
- const Datum *length_hist_values,
- int length_hist_nvalues);
-static double calc_hist_selectivity_contains(TypeCacheEntry *typcache,
- const RangeBound *lower,
- const RangeBound *upper,
- const RangeBound *hist_lower,
- int hist_nvalues,
- const Datum *length_hist_values,
- int length_hist_nvalues);
/*
* Returns a default selectivity estimate for given operator, when we don't
@@ -698,746 +668,6 @@ calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
return hist_selec;
}
-
-/*
- * Look up the fraction of values less than (or equal, if 'equal' argument
- * is true) a given const in a histogram of range bounds.
- */
-static double
-calc_hist_selectivity_scalar(TypeCacheEntry *typcache, const RangeBound *constbound,
- const RangeBound *hist, int hist_nvalues, bool equal)
-{
- Selectivity selec;
- int index;
-
- /*
- * Find the histogram bin the given constant falls into. Estimate
- * selectivity as the number of preceding whole bins.
- */
- index = rbound_bsearch(typcache, constbound, hist, hist_nvalues, equal);
- selec = (Selectivity) (Max(index, 0)) / (Selectivity) (hist_nvalues - 1);
-
- /* Adjust using linear interpolation within the bin */
- if (index >= 0 && index < hist_nvalues - 1)
- selec += get_position(typcache, constbound, &hist[index],
- &hist[index + 1]) / (Selectivity) (hist_nvalues - 1);
-
- return selec;
-}
-
-/*
- * Binary search on an array of range bounds. Returns greatest index of range
- * bound in array which is less(less or equal) than given range bound. If all
- * range bounds in array are greater or equal(greater) than given range bound,
- * return -1. When "equal" flag is set conditions in brackets are used.
- *
- * This function is used in scalar operator selectivity estimation. Another
- * goal of this function is to find a histogram bin where to stop
- * interpolation of portion of bounds which are less than or equal to given bound.
- */
-static int
-rbound_bsearch(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound *hist,
- int hist_length, bool equal)
-{
- int lower = -1,
- upper = hist_length - 1,
- cmp,
- middle;
-
- while (lower < upper)
- {
- middle = (lower + upper + 1) / 2;
- cmp = range_cmp_bounds(typcache, &hist[middle], value);
-
- if (cmp < 0 || (equal && cmp == 0))
- lower = middle;
- else
- upper = middle - 1;
- }
- return lower;
-}
-
-
-/*
- * Binary search on length histogram. Returns greatest index of range length in
- * histogram which is less than (less than or equal) the given length value. If
- * all lengths in the histogram are greater than (greater than or equal) the
- * given length, returns -1.
- */
-static int
-length_hist_bsearch(const Datum *length_hist_values, int length_hist_nvalues,
- double value, bool equal)
-{
- int lower = -1,
- upper = length_hist_nvalues - 1,
- middle;
-
- while (lower < upper)
- {
- double middleval;
-
- middle = (lower + upper + 1) / 2;
-
- middleval = DatumGetFloat8(length_hist_values[middle]);
- if (middleval < value || (equal && middleval <= value))
- lower = middle;
- else
- upper = middle - 1;
- }
- return lower;
-}
-
-/*
- * Get relative position of value in histogram bin in [0,1] range.
- */
-static float8
-get_position(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound *hist1,
- const RangeBound *hist2)
-{
- bool has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
- float8 position;
-
- if (!hist1->infinite && !hist2->infinite)
- {
- float8 bin_width;
-
- /*
- * Both bounds are finite. Assuming the subtype's comparison function
- * works sanely, the value must be finite, too, because it lies
- * somewhere between the bounds. If it doesn't, arbitrarily return
- * 0.5.
- */
- if (value->infinite)
- return 0.5;
-
- /* Can't interpolate without subdiff function */
- if (!has_subdiff)
- return 0.5;
-
- /* Calculate relative position using subdiff function. */
- bin_width = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
- typcache->rng_collation,
- hist2->val,
- hist1->val));
- if (isnan(bin_width) || bin_width <= 0.0)
- return 0.5; /* punt for NaN or zero-width bin */
-
- position = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
- typcache->rng_collation,
- value->val,
- hist1->val))
- / bin_width;
-
- if (isnan(position))
- return 0.5; /* punt for NaN from subdiff, Inf/Inf, etc */
-
- /* Relative position must be in [0,1] range */
- position = Max(position, 0.0);
- position = Min(position, 1.0);
- return position;
- }
- else if (hist1->infinite && !hist2->infinite)
- {
- /*
- * Lower bin boundary is -infinite, upper is finite. If the value is
- * -infinite, return 0.0 to indicate it's equal to the lower bound.
- * Otherwise return 1.0 to indicate it's infinitely far from the lower
- * bound.
- */
- return ((value->infinite && value->lower) ? 0.0 : 1.0);
- }
- else if (!hist1->infinite && hist2->infinite)
- {
- /* same as above, but in reverse */
- return ((value->infinite && !value->lower) ? 1.0 : 0.0);
- }
- else
- {
- /*
- * If both bin boundaries are infinite, they should be equal to each
- * other, and the value should also be infinite and equal to both
- * bounds. (But don't Assert that, to avoid crashing if a user creates
- * a datatype with a broken comparison function).
- *
- * Assume the value to lie in the middle of the infinite bounds.
- */
- return 0.5;
- }
-}
-
-
-/*
- * Get relative position of value in a length histogram bin in [0,1] range.
- */
-static double
-get_len_position(double value, double hist1, double hist2)
-{
- if (!isinf(hist1) && !isinf(hist2))
- {
- /*
- * Both bounds are finite. The value should be finite too, because it
- * lies somewhere between the bounds. If it doesn't, just return
- * something.
- */
- if (isinf(value))
- return 0.5;
-
- return 1.0 - (hist2 - value) / (hist2 - hist1);
- }
- else if (isinf(hist1) && !isinf(hist2))
- {
- /*
- * Lower bin boundary is -infinite, upper is finite. Return 1.0 to
- * indicate the value is infinitely far from the lower bound.
- */
- return 1.0;
- }
- else if (isinf(hist1) && isinf(hist2))
- {
- /* same as above, but in reverse */
- return 0.0;
- }
- else
- {
- /*
- * If both bin boundaries are infinite, they should be equal to each
- * other, and the value should also be infinite and equal to both
- * bounds. (But don't Assert that, to avoid crashing unnecessarily if
- * the caller messes up)
- *
- * Assume the value to lie in the middle of the infinite bounds.
- */
- return 0.5;
- }
-}
-
-/*
- * Measure distance between two range bounds.
- */
-static float8
-get_distance(TypeCacheEntry *typcache, const RangeBound *bound1, const RangeBound *bound2)
-{
- bool has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
-
- if (!bound1->infinite && !bound2->infinite)
- {
- /*
- * Neither bound is infinite, use subdiff function or return default
- * value of 1.0 if no subdiff is available.
- */
- if (has_subdiff)
- {
- float8 res;
-
- res = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
- typcache->rng_collation,
- bound2->val,
- bound1->val));
- /* Reject possible NaN result, also negative result */
- if (isnan(res) || res < 0.0)
- return 1.0;
- else
- return res;
- }
- else
- return 1.0;
- }
- else if (bound1->infinite && bound2->infinite)
- {
- /* Both bounds are infinite */
- if (bound1->lower == bound2->lower)
- return 0.0;
- else
- return get_float8_infinity();
- }
- else
- {
- /* One bound is infinite, the other is not */
- return get_float8_infinity();
- }
-}
-
-/*
- * Calculate the average of function P(x), in the interval [length1, length2],
- * where P(x) is the fraction of tuples with length < x (or length <= x if
- * 'equal' is true).
- */
-static double
-calc_length_hist_frac(const Datum *length_hist_values, int length_hist_nvalues,
- double length1, double length2, bool equal)
-{
- double frac;
- double A,
- B,
- PA,
- PB;
- double pos;
- int i;
- double area;
-
- Assert(length2 >= length1);
-
- if (length2 < 0.0)
- return 0.0; /* shouldn't happen, but doesn't hurt to check */
-
- /* All lengths in the table are <= infinite. */
- if (isinf(length2) && equal)
- return 1.0;
-
- /*----------
- * The average of a function between A and B can be calculated by the
- * formula:
- *
- * B
- * 1 /
- * ------- | P(x)dx
- * B - A /
- * A
- *
- * The geometrical interpretation of the integral is the area under the
- * graph of P(x). P(x) is defined by the length histogram. We calculate
- * the area in a piecewise fashion, iterating through the length histogram
- * bins. Each bin is a trapezoid:
- *
- * P(x2)
- * /|
- * / |
- * P(x1)/ |
- * | |
- * | |
- * ---+---+--
- * x1 x2
- *
- * where x1 and x2 are the boundaries of the current histogram, and P(x1)
- * and P(x1) are the cumulative fraction of tuples at the boundaries.
- *
- * The area of each trapezoid is 1/2 * (P(x2) + P(x1)) * (x2 - x1)
- *
- * The first bin contains the lower bound passed by the caller, so we
- * use linear interpolation between the previous and next histogram bin
- * boundary to calculate P(x1). Likewise for the last bin: we use linear
- * interpolation to calculate P(x2). For the bins in between, x1 and x2
- * lie on histogram bin boundaries, so P(x1) and P(x2) are simply:
- * P(x1) = (bin index) / (number of bins)
- * P(x2) = (bin index + 1 / (number of bins)
- */
-
- /* First bin, the one that contains lower bound */
- i = length_hist_bsearch(length_hist_values, length_hist_nvalues, length1, equal);
- if (i >= length_hist_nvalues - 1)
- return 1.0;
-
- if (i < 0)
- {
- i = 0;
- pos = 0.0;
- }
- else
- {
- /* interpolate length1's position in the bin */
- pos = get_len_position(length1,
- DatumGetFloat8(length_hist_values[i]),
- DatumGetFloat8(length_hist_values[i + 1]));
- }
- PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
- B = length1;
-
- /*
- * In the degenerate case that length1 == length2, simply return
- * P(length1). This is not merely an optimization: if length1 == length2,
- * we'd divide by zero later on.
- */
- if (length2 == length1)
- return PB;
-
- /*
- * Loop through all the bins, until we hit the last bin, the one that
- * contains the upper bound. (if lower and upper bounds are in the same
- * bin, this falls out immediately)
- */
- area = 0.0;
- for (; i < length_hist_nvalues - 1; i++)
- {
- double bin_upper = DatumGetFloat8(length_hist_values[i + 1]);
-
- /* check if we've reached the last bin */
- if (!(bin_upper < length2 || (equal && bin_upper <= length2)))
- break;
-
- /* the upper bound of previous bin is the lower bound of this bin */
- A = B;
- PA = PB;
-
- B = bin_upper;
- PB = (double) i / (double) (length_hist_nvalues - 1);
-
- /*
- * Add the area of this trapezoid to the total. The point of the
- * if-check is to avoid NaN, in the corner case that PA == PB == 0,
- * and B - A == Inf. The area of a zero-height trapezoid (PA == PB ==
- * 0) is zero, regardless of the width (B - A).
- */
- if (PA > 0 || PB > 0)
- area += 0.5 * (PB + PA) * (B - A);
- }
-
- /* Last bin */
- A = B;
- PA = PB;
-
- B = length2; /* last bin ends at the query upper bound */
- if (i >= length_hist_nvalues - 1)
- pos = 0.0;
- else
- {
- if (DatumGetFloat8(length_hist_values[i]) == DatumGetFloat8(length_hist_values[i + 1]))
- pos = 0.0;
- else
- pos = get_len_position(length2,
- DatumGetFloat8(length_hist_values[i]),
- DatumGetFloat8(length_hist_values[i + 1]));
- }
- PB = (((double) i) + pos) / (double) (length_hist_nvalues - 1);
-
- if (PA > 0 || PB > 0)
- area += 0.5 * (PB + PA) * (B - A);
-
- /*
- * Ok, we have calculated the area, ie. the integral. Divide by width to
- * get the requested average.
- *
- * Avoid NaN arising from infinite / infinite. This happens at least if
- * length2 is infinite. It's not clear what the correct value would be in
- * that case, so 0.5 seems as good as any value.
- */
- if (isinf(area) && isinf(length2))
- frac = 0.5;
- else
- frac = area / (length2 - length1);
-
- return frac;
-}
-
-/*
- * Calculate selectivity of "var <@ const" operator, ie. estimate the fraction
- * of multiranges that fall within the constant lower and upper bounds. This uses
- * the histograms of range lower bounds and range lengths, on the assumption
- * that the range lengths are independent of the lower bounds.
- *
- * The caller has already checked that constant lower and upper bounds are
- * finite.
- */
-static double
-calc_hist_selectivity_contained(TypeCacheEntry *typcache,
- const RangeBound *lower, RangeBound *upper,
- const RangeBound *hist_lower, int hist_nvalues,
- const Datum *length_hist_values, int length_hist_nvalues)
-{
- int i,
- upper_index;
- float8 prev_dist;
- double bin_width;
- double upper_bin_width;
- double sum_frac;
-
- /*
- * Begin by finding the bin containing the upper bound, in the lower bound
- * histogram. Any range with a lower bound > constant upper bound can't
- * match, ie. there are no matches in bins greater than upper_index.
- */
- upper->inclusive = !upper->inclusive;
- upper->lower = true;
- upper_index = rbound_bsearch(typcache, upper, hist_lower, hist_nvalues,
- false);
-
- /*
- * If the upper bound value is below the histogram's lower limit, there
- * are no matches.
- */
- if (upper_index < 0)
- return 0.0;
-
- /*
- * If the upper bound value is at or beyond the histogram's upper limit,
- * start our loop at the last actual bin, as though the upper bound were
- * within that bin; get_position will clamp its result to 1.0 anyway.
- * (This corresponds to assuming that the data population above the
- * histogram's upper limit is empty, exactly like what we just assumed for
- * the lower limit.)
- */
- upper_index = Min(upper_index, hist_nvalues - 2);
-
- /*
- * Calculate upper_bin_width, ie. the fraction of the (upper_index,
- * upper_index + 1) bin which is greater than upper bound of query range
- * using linear interpolation of subdiff function.
- */
- upper_bin_width = get_position(typcache, upper,
- &hist_lower[upper_index],
- &hist_lower[upper_index + 1]);
-
- /*
- * In the loop, dist and prev_dist are the distance of the "current" bin's
- * lower and upper bounds from the constant upper bound.
- *
- * bin_width represents the width of the current bin. Normally it is 1.0,
- * meaning a full width bin, but can be less in the corner cases: start
- * and end of the loop. We start with bin_width = upper_bin_width, because
- * we begin at the bin containing the upper bound.
- */
- prev_dist = 0.0;
- bin_width = upper_bin_width;
-
- sum_frac = 0.0;
- for (i = upper_index; i >= 0; i--)
- {
- double dist;
- double length_hist_frac;
- bool final_bin = false;
-
- /*
- * dist -- distance from upper bound of query range to lower bound of
- * the current bin in the lower bound histogram. Or to the lower bound
- * of the constant range, if this is the final bin, containing the
- * constant lower bound.
- */
- if (range_cmp_bounds(typcache, &hist_lower[i], lower) < 0)
- {
- dist = get_distance(typcache, lower, upper);
-
- /*
- * Subtract from bin_width the portion of this bin that we want to
- * ignore.
- */
- bin_width -= get_position(typcache, lower, &hist_lower[i],
- &hist_lower[i + 1]);
- if (bin_width < 0.0)
- bin_width = 0.0;
- final_bin = true;
- }
- else
- dist = get_distance(typcache, &hist_lower[i], upper);
-
- /*
- * Estimate the fraction of tuples in this bin that are narrow enough
- * to not exceed the distance to the upper bound of the query range.
- */
- length_hist_frac = calc_length_hist_frac(length_hist_values,
- length_hist_nvalues,
- prev_dist, dist, true);
-
- /*
- * Add the fraction of tuples in this bin, with a suitable length, to
- * the total.
- */
- sum_frac += length_hist_frac * bin_width / (double) (hist_nvalues - 1);
-
- if (final_bin)
- break;
-
- bin_width = 1.0;
- prev_dist = dist;
- }
-
- return sum_frac;
-}
-
-/*
- * Calculate selectivity of "var @> const" operator, ie. estimate the fraction
- * of multiranges that contain the constant lower and upper bounds. This uses
- * the histograms of range lower bounds and range lengths, on the assumption
- * that the range lengths are independent of the lower bounds.
- */
-static double
-calc_hist_selectivity_contains(TypeCacheEntry *typcache,
- const RangeBound *lower, const RangeBound *upper,
- const RangeBound *hist_lower, int hist_nvalues,
- const Datum *length_hist_values, int length_hist_nvalues)
-{
- int i,
- lower_index;
- double bin_width,
- lower_bin_width;
- double sum_frac;
- float8 prev_dist;
-
- /* Find the bin containing the lower bound of query range. */
- lower_index = rbound_bsearch(typcache, lower, hist_lower, hist_nvalues,
- true);
-
- /*
- * If the lower bound value is below the histogram's lower limit, there
- * are no matches.
- */
- if (lower_index < 0)
- return 0.0;
-
- /*
- * If the lower bound value is at or beyond the histogram's upper limit,
- * start our loop at the last actual bin, as though the upper bound were
- * within that bin; get_position will clamp its result to 1.0 anyway.
- * (This corresponds to assuming that the data population above the
- * histogram's upper limit is empty, exactly like what we just assumed for
- * the lower limit.)
- */
- lower_index = Min(lower_index, hist_nvalues - 2);
-
- /*
- * Calculate lower_bin_width, ie. the fraction of the of (lower_index,
- * lower_index + 1) bin which is greater than lower bound of query range
- * using linear interpolation of subdiff function.
- */
- lower_bin_width = get_position(typcache, lower, &hist_lower[lower_index],
- &hist_lower[lower_index + 1]);
-
- /*
- * Loop through all the lower bound bins, smaller than the query lower
- * bound. In the loop, dist and prev_dist are the distance of the
- * "current" bin's lower and upper bounds from the constant upper bound.
- * We begin from query lower bound, and walk backwards, so the first bin's
- * upper bound is the query lower bound, and its distance to the query
- * upper bound is the length of the query range.
- *
- * bin_width represents the width of the current bin. Normally it is 1.0,
- * meaning a full width bin, except for the first bin, which is only
- * counted up to the constant lower bound.
- */
- prev_dist = get_distance(typcache, lower, upper);
- sum_frac = 0.0;
- bin_width = lower_bin_width;
- for (i = lower_index; i >= 0; i--)
- {
- float8 dist;
- double length_hist_frac;
-
- /*
- * dist -- distance from upper bound of query range to current value
- * of lower bound histogram or lower bound of query range (if we've
- * reach it).
- */
- dist = get_distance(typcache, &hist_lower[i], upper);
-
- /*
- * Get average fraction of length histogram which covers intervals
- * longer than (or equal to) distance to upper bound of query range.
- */
- length_hist_frac =
- 1.0 - calc_length_hist_frac(length_hist_values,
- length_hist_nvalues,
- prev_dist, dist, false);
-
- sum_frac += length_hist_frac * bin_width / (double) (hist_nvalues - 1);
-
- bin_width = 1.0;
- prev_dist = dist;
- }
-
- return sum_frac;
-}
-
-/*
- * Estimate join selectivity P(X < Y) using rangebound histograms.
- *
- * Based on: Diogo Repas, Zhicheng Luo, Maxime Schoemans, Mahmoud Sakr, 2022
- * "Selectivity Estimation of Inequality Joins In Databases"
- * https://doi.org/10.48550/arXiv.2206.07396
- *
- * hist1 and hist2 are arrays of RangeBound entries from the bounds histograms
- * of two range-typed or multirange-typed attributes X and Y, respectively.
- * Each array has at least 2 entries (one histogram bin). The entries carry
- * full bound metadata (lower/upper flag, inclusive/exclusive), and all
- * comparisons use range_cmp_bounds() so that bound semantics are preserved.
- *
- * The algorithm models each attribute's distribution as a piecewise function
- * derived from its histogram, then computes:
- * P(X < Y) = 0.5 * sum( (F_X(prev) + F_X(cur)) * (F_Y(cur) - F_Y(prev)) )
- * by parallel-scanning both histograms.
- *
- * The initial fast-forward loops skip histogram entries that fall entirely
- * before the other histogram's range, so the main loop only processes the
- * overlapping region. Bounds checks are required because the histograms may
- * be completely disjoint (e.g., all of X is below all of Y).
- */
-static double
-calc_hist_join_selectivity(TypeCacheEntry *typcache,
- const RangeBound *hist1, int nhist1,
- const RangeBound *hist2, int nhist2)
-{
- int i,
- j;
- double selectivity = 0.0;
- double prev_sel1 = -1.0; /* negative sentinel skips first iter */
- double prev_sel2 = 0.0;
-
- Assert(nhist1 > 1);
- Assert(nhist2 > 1);
-
- /*
- * Fast-forward past hist1 entries that are entirely below hist2[0], and
- * vice versa. Bounds checks prevent out-of-bounds access when the
- * histograms are fully disjoint.
- */
- for (i = 0; i < nhist1 &&
- range_cmp_bounds(typcache, &hist1[i], &hist2[0]) < 0; i++)
- ;
- for (j = 0; j < nhist2 &&
- range_cmp_bounds(typcache, &hist2[j], &hist1[0]) < 0; j++)
- ;
-
- /*
- * Handle fully-separated histograms. When all bounds in hist1 are below
- * all bounds in hist2, P(X < Y) is ~1.0. When all of hist2 is below
- * hist1, P(X < Y) is ~0.0. We return immediately rather than falling
- * into the overlap walk with invalid indices.
- */
- if (i >= nhist1)
- return 1.0;
- if (j >= nhist2)
- return 0.0;
-
- /* Walk the overlapping region of both histograms */
- while (i < nhist1 && j < nhist2)
- {
- double cur_sel1,
- cur_sel2;
- RangeBound cur_sync;
- int cmp;
-
- cmp = range_cmp_bounds(typcache, &hist1[i], &hist2[j]);
- if (cmp < 0)
- cur_sync = hist1[i++];
- else if (cmp > 0)
- cur_sync = hist2[j++];
- else
- {
- /* Equal bounds: advance both */
- cur_sync = hist1[i];
- i++;
- j++;
- }
- cur_sel1 = calc_hist_selectivity_scalar(typcache, &cur_sync,
- hist1, nhist1, false);
- cur_sel2 = calc_hist_selectivity_scalar(typcache, &cur_sync,
- hist2, nhist2, false);
-
- /* Skip the first iteration (no previous point yet) */
- if (prev_sel1 >= 0)
- selectivity += (prev_sel1 + cur_sel1) * (cur_sel2 - prev_sel2);
-
- prev_sel1 = cur_sel1;
- prev_sel2 = cur_sel2;
- }
-
- /* P(X < Y) = 0.5 * Sum(...) */
- selectivity /= 2;
-
- /* Include remainder of hist2 if hist1 was exhausted first */
- if (j < nhist2)
- selectivity += 1 - prev_sel2;
-
- return selectivity;
-}
-
/*
* multirangejoinsel -- join selectivity for multirange operators
*
diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
index cc702f28610..4f4baa7dc1a 100644
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -26,6 +26,7 @@
#include "utils/fmgrprotos.h"
#include "utils/lsyscache.h"
#include "utils/rangetypes.h"
+#include "utils/rangetypes_selfuncs.h"
#include "utils/selfuncs.h"
#include "utils/typcache.h"
@@ -35,29 +36,6 @@ static double default_range_selectivity(Oid operator);
static double calc_hist_selectivity(TypeCacheEntry *typcache,
VariableStatData *vardata, const RangeType *constval,
Oid operator);
-static double calc_hist_selectivity_scalar(TypeCacheEntry *typcache,
- const RangeBound *constbound,
- const RangeBound *hist, int hist_nvalues,
- bool equal);
-static int rbound_bsearch(TypeCacheEntry *typcache, const RangeBound *value,
- const RangeBound *hist, int hist_length, bool equal);
-static float8 get_position(TypeCacheEntry *typcache, const RangeBound *value,
- const RangeBound *hist1, const RangeBound *hist2);
-static float8 get_len_position(double value, double hist1, double hist2);
-static float8 get_distance(TypeCacheEntry *typcache, const RangeBound *bound1,
- const RangeBound *bound2);
-static int length_hist_bsearch(const Datum *length_hist_values,
- int length_hist_nvalues, double value, bool equal);
-static double calc_length_hist_frac(const Datum *length_hist_values,
- int length_hist_nvalues, double length1, double length2, bool equal);
-static double calc_hist_selectivity_contained(TypeCacheEntry *typcache,
- const RangeBound *lower, RangeBound *upper,
- const RangeBound *hist_lower, int hist_nvalues,
- const Datum *length_hist_values, int length_hist_nvalues);
-static double calc_hist_selectivity_contains(TypeCacheEntry *typcache,
- const RangeBound *lower, const RangeBound *upper,
- const RangeBound *hist_lower, int hist_nvalues,
- const Datum *length_hist_values, int length_hist_nvalues);
/*
* Returns a default selectivity estimate for given operator, when we don't
@@ -592,7 +570,7 @@ calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
* Look up the fraction of values less than (or equal, if 'equal' argument
* is true) a given const in a histogram of range bounds.
*/
-static double
+double
calc_hist_selectivity_scalar(TypeCacheEntry *typcache, const RangeBound *constbound,
const RangeBound *hist, int hist_nvalues, bool equal)
{
@@ -624,7 +602,7 @@ calc_hist_selectivity_scalar(TypeCacheEntry *typcache, const RangeBound *constbo
* goal of this function is to find a histogram bin where to stop
* interpolation of portion of bounds which are less than or equal to given bound.
*/
-static int
+int
rbound_bsearch(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound *hist,
int hist_length, bool equal)
{
@@ -653,7 +631,7 @@ rbound_bsearch(TypeCacheEntry *typcache, const RangeBound *value, const RangeBou
* all lengths in the histogram are greater than (greater than or equal) the
* given length, returns -1.
*/
-static int
+int
length_hist_bsearch(const Datum *length_hist_values, int length_hist_nvalues,
double value, bool equal)
{
@@ -679,7 +657,7 @@ length_hist_bsearch(const Datum *length_hist_values, int length_hist_nvalues,
/*
* Get relative position of value in histogram bin in [0,1] range.
*/
-static float8
+float8
get_position(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound *hist1,
const RangeBound *hist2)
{
@@ -758,7 +736,7 @@ get_position(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound
/*
* Get relative position of value in a length histogram bin in [0,1] range.
*/
-static double
+double
get_len_position(double value, double hist1, double hist2)
{
if (!isinf(hist1) && !isinf(hist2))
@@ -803,7 +781,7 @@ get_len_position(double value, double hist1, double hist2)
/*
* Measure distance between two range bounds.
*/
-static float8
+float8
get_distance(TypeCacheEntry *typcache, const RangeBound *bound1, const RangeBound *bound2)
{
bool has_subdiff = OidIsValid(typcache->rng_subdiff_finfo.fn_oid);
@@ -851,7 +829,7 @@ get_distance(TypeCacheEntry *typcache, const RangeBound *bound1, const RangeBoun
* where P(x) is the fraction of tuples with length < x (or length <= x if
* 'equal' is true).
*/
-static double
+double
calc_length_hist_frac(const Datum *length_hist_values, int length_hist_nvalues,
double length1, double length2, bool equal)
{
@@ -1014,7 +992,7 @@ calc_length_hist_frac(const Datum *length_hist_values, int length_hist_nvalues,
* The caller has already checked that constant lower and upper bounds are
* finite.
*/
-static double
+double
calc_hist_selectivity_contained(TypeCacheEntry *typcache,
const RangeBound *lower, RangeBound *upper,
const RangeBound *hist_lower, int hist_nvalues,
@@ -1135,7 +1113,7 @@ calc_hist_selectivity_contained(TypeCacheEntry *typcache,
* the histograms of range lower bounds and range lengths, on the assumption
* that the range lengths are independent of the lower bounds.
*/
-static double
+double
calc_hist_selectivity_contains(TypeCacheEntry *typcache,
const RangeBound *lower, const RangeBound *upper,
const RangeBound *hist_lower, int hist_nvalues,
@@ -1230,7 +1208,7 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
* https://doi.org/10.48550/arXiv.2206.07396
*
* hist1 and hist2 are arrays of RangeBound entries from the bounds histograms
- * of two range-typed attributes X and Y, respectively. Each array has at
+ * of two range- or multirange-typed attributes X and Y, respectively. Each array has at
* least 2 entries (one histogram bin). The entries carry full bound metadata
* (lower/upper flag, inclusive/exclusive), and all comparisons use
* range_cmp_bounds() so that bound semantics are preserved.
@@ -1245,7 +1223,7 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
* overlapping region. Bounds checks are required because the histograms may
* be completely disjoint (e.g., all of X is below all of Y).
*/
-static double
+double
calc_hist_join_selectivity(TypeCacheEntry *typcache,
const RangeBound *hist1, int nhist1,
const RangeBound *hist2, int nhist2)
diff --git a/src/include/utils/rangetypes_selfuncs.h b/src/include/utils/rangetypes_selfuncs.h
new file mode 100644
index 00000000000..be6bda9ab11
--- /dev/null
+++ b/src/include/utils/rangetypes_selfuncs.h
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * rangetypes_selfuncs.h
+ * Shared helper functions for range and multirange selectivity estimation.
+ *
+ * These functions are defined in rangetypes_selfuncs.c and used by both
+ * rangetypes_selfuncs.c and multirangetypes_selfuncs.c.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/rangetypes_selfuncs.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RANGETYPES_SELFUNCS_H
+#define RANGETYPES_SELFUNCS_H
+
+#include "utils/rangetypes.h"
+
+extern double calc_hist_selectivity_scalar(TypeCacheEntry *typcache,
+ const RangeBound *constbound,
+ const RangeBound *hist, int hist_nvalues,
+ bool equal);
+extern int rbound_bsearch(TypeCacheEntry *typcache,
+ const RangeBound *value, const RangeBound *hist,
+ int hist_length, bool equal);
+extern int length_hist_bsearch(const Datum *length_hist_values,
+ int length_hist_nvalues,
+ double value, bool equal);
+extern float8 get_position(TypeCacheEntry *typcache,
+ const RangeBound *value,
+ const RangeBound *hist1, const RangeBound *hist2);
+extern double get_len_position(double value, double hist1, double hist2);
+extern float8 get_distance(TypeCacheEntry *typcache,
+ const RangeBound *bound1, const RangeBound *bound2);
+extern double calc_length_hist_frac(const Datum *length_hist_values,
+ int length_hist_nvalues,
+ double length1, double length2, bool equal);
+extern double calc_hist_selectivity_contained(TypeCacheEntry *typcache,
+ const RangeBound *lower, RangeBound *upper,
+ const RangeBound *hist_lower, int hist_nvalues,
+ const Datum *length_hist_values,
+ int length_hist_nvalues);
+extern double calc_hist_selectivity_contains(TypeCacheEntry *typcache,
+ const RangeBound *lower, const RangeBound *upper,
+ const RangeBound *hist_lower, int hist_nvalues,
+ const Datum *length_hist_values,
+ int length_hist_nvalues);
+extern double calc_hist_join_selectivity(TypeCacheEntry *typcache,
+ const RangeBound *hist1, int nhist1,
+ const RangeBound *hist2, int nhist2);
+
+#endif /* RANGETYPES_SELFUNCS_H */
--
2.50.1 (Apple Git-155)
view thread (16+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Implement missing join selectivity estimation for range types
In-Reply-To: <AM0P190MB07243E890A7DFD6F24F70269F0232@AM0P190MB0724.EURP190.PROD.OUTLOOK.COM>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox