public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Implement missing join selectivity estimation for range types
2+ messages / 1 participants
[nested] [flat]

* Re: Implement missing join selectivity estimation for range types
@ 2023-01-18 17:25  Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Tomas Vondra @ 2023-01-18 17:25 UTC (permalink / raw)
  To: Mahmoud Sakr <[email protected]>; PostgreSQL Hackers <[email protected]>; +Cc: SCHOEMANS Maxime <[email protected]>; Diogo Repas <[email protected]>; Luo Zhicheng <[email protected]>; Andrey Lepikhov <[email protected]>

Hello Mahmoud,

Thanks for the patch and sorry for not taking a look earlier.

On 6/30/22 16:31, Mahmoud Sakr wrote:
> Hi,
> Given a query:
> SELECT * FROM t1, t2 WHERE t1.r << t2.r
> where t1.r, t2.r are of range type,
> currently PostgreSQL will estimate a constant selectivity for the << predicate,
> which is equal to 0.005, not utilizing the statistics that the optimizer
> collects for range attributes.
> 
> We have worked out a theory for inequality join selectivity estimation
> (http://arxiv.org/abs/2206.07396), and implemented it for range
> types it in this patch.
> 

Interesting. Are there any particular differences compared to how we
estimate for example range clauses on regular columns?

> The algorithm in this patch re-uses the currently collected statistics for
> range types, which is the bounds histogram. It works fairly accurate for the
> operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%.

Right. I think 0.5% is roughly expected for the default statistics
target, which creates 100 histogram bins, each representing ~1% of the
values. Which on average means ~0.5% error.

> The patch also implements selectivity estimation for the
> operations @>, <@ (contains and is contained in), but their accuracy is not
> stable, since the bounds histograms assume independence between the range
> bounds. A point to discuss is whether or not to keep these last two operations.

That's a good question. I think the independence assumption is rather
foolish in this case, so I wonder if we could "stabilize" this by making
some different - less optimistic - assumption. Essentially, we have an
estimates for lower/upper boundaries:

  P1 = P(lower(var1) <= lower(var2))
  P2 = P(upper(var2) <= upper(var1))

and independence means we take (P1*P2). But maybe we should be very
pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?

Another option is to use the length histogram, right? I mean, we know
what the average length is, and it should be possible to use that to
calculate how "far" ranges in a histogram can overlap.

> The patch also includes the selectivity estimation for multirange types,
> treating a multirange as a single range which is its bounding box.
> 

OK. But ideally we'd cross-check elements of the two multiranges, no?

> The same algorithm in this patch is applicable to inequality joins of scalar
> types. We, however, don't implement it for scalars, since more work is needed
> to make use of the other statistics available for scalars, such as the MCV.
> This is left as a future work.
> 

So if the column(s) contain a couple very common (multi)ranges that make
it into an MCV, we'll ignore those? That's a bit unfortunate, because
those MCV elements are potentially the main contributors to selectivity.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: Implement missing join selectivity estimation for range types
@ 2023-01-18 18:07  Tomas Vondra <[email protected]>
  parent: Tomas Vondra <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Tomas Vondra @ 2023-01-18 18:07 UTC (permalink / raw)
  To: Mahmoud Sakr <[email protected]>; PostgreSQL Hackers <[email protected]>; +Cc: SCHOEMANS Maxime <[email protected]>; Diogo Repas <[email protected]>; Luo Zhicheng <[email protected]>; Andrey Lepikhov <[email protected]>

Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".

However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






^ permalink  raw  reply  [nested|flat] 2+ messages in thread


end of thread, other threads:[~2023-01-18 18:07 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-01-18 17:25 Re: Implement missing join selectivity estimation for range types Tomas Vondra <[email protected]>
2023-01-18 18:07 ` Tomas Vondra <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox